Uploaded by sayesix944

Understanding The Linux Virtual Memory Manager

advertisement
Understanding The
Linux Virtual Memory Manager
Mel Gorman
July 9, 2007
Preface
Linux is developed with a stronger practical emphasis than a theoretical one. When
new algorithms or changes to existing implementations are suggested, it is common
to request code to match the argument. Many of the algorithms used in the Virtual
Memory (VM) system were designed by theorists but the implementations have now
diverged from the theory considerably.
In part, Linux does follow the traditional
development cycle of design to implementation but it is more common for changes
to be made in reaction to how the system behaved in the real-world and intuitive
decisions by developers.
This means that the VM performs well in practice but there is very little VM
specic documentation available except for a few incomplete overviews in a small
number of websites, except the web site containing an earlier draft of this book
of course!
This has lead to the situation where the VM is fully understood only
by a small number of core developers. New developers looking for information on
how it functions are generally told to read the source and little or no information is
available on the theoretical basis for the implementation. This requires that even a
casual observer invest a large amount of time to read the code and study the eld
of Memory Management.
This book, gives a detailed tour of the Linux VM as implemented in
2.4.22
and gives a solid introduction of what to expect in 2.6. As well as discussing the
implementation, the theory it is is based on will also be introduced.
This is not
intended to be a memory management theory book but it is often much simpler to
understand why the VM is implemented in a particular fashion if the underlying
basis is known in advance.
To complement the description, the appendix includes a detailed code commentary on a signicant percentage of the VM. This should drastically reduce the amount
of time a developer or researcher needs to invest in understanding what is happening
inside the Linux VM. As VM implementations tend to follow similar code patterns
even between major versions. This means that with a solid understanding of the 2.4
VM, the later 2.5 development VMs and the nal 2.6 release will be decipherable in
a number of weeks.
The Intended Audience
Anyone interested in how the VM, a core kernel subsystem, works will nd answers
to many of their questions in this book. The VM, more than any other subsystem,
i
Preface
ii
aects the overall performance of the operating system. It is also one of the most
poorly understood and badly documented subsystem in Linux, partially because
there is, quite literally, so much of it. It is very dicult to isolate and understand
individual parts of the code without rst having a strong conceptual model of the
whole VM, so this book intends to give a detailed description of what to expect
without before going to the source.
This material should be of prime interest to new developers interested in adapting
the VM to their needs and to readers who simply would like to know how the VM
works. It also will benet other subsystem developers who want to get the most from
the VM when they interact with it and operating systems researchers looking for
details on how memory management is implemented in a modern operating system.
For others, who are just curious to learn more about a subsystem that is the focus of
so much discussion, they will nd an easy to read description of the VM functionality
that covers all the details without the need to plough through source code.
However, it is assumed that the reader has read at least one general operating
system book or one general Linux kernel orientated book and has a general knowledge of C before tackling this book. While every eort is made to make the material
approachable, some prior knowledge of general operating systems is assumed.
Book Overview
In chapter 1, we go into detail on how the source code may be managed and deciphered. Three tools will be introduced that are used for the analysis, easy browsing
and management of code. The main tools are the
Linux Cross Referencing (LXR)
CodeViz for gener-
tool which allows source code to be browsed as a web page and
ating call graphs which was developed while researching this book. The last tool,
PatchSet
is for managing kernels and the application of patches. Applying patches
manually can be time consuming and the use of version control software such as
CVS (http://www.cvshome.org/ ) or BitKeeper (http://www.bitmover.com ) are not
always an option. With this tool, a simple specication le determines what source
to use, what patches to apply and what kernel conguration to use.
In the subsequent chapters, each part of the Linux VM implementation will be
discussed in detail, such as how memory is described in an architecture independent
manner, how processes manage their memory, how the specic allocators work and
so on.
Each will refer to the papers that describe closest the behaviour of Linux
as well as covering in depth the implementation, the functions used and their call
graphs so the reader will have a clear view of how the code is structured. At the
end of each chapter, there will be a What's New section which introduces what to
expect in the 2.6 VM.
The appendices are a code commentary of a signicant percentage of the VM. It
gives a line by line description of some of the more complex aspects of the VM. The
style of the VM tends to be reasonably consistent, even between major releases of
the kernel so an in-depth understanding of the 2.4 VM will be an invaluable aid to
understanding the 2.6 kernel when it is released.
Preface
iii
What's New in 2.6
At the time of writing,
2.6.0-test4
has just been released so
2.6.0-final
is due
any month now which means December 2003 or early 2004. Fortunately the 2.6
VM, in most ways, is still quite recognisable in comparison to 2.4. However, there
is some new material and concepts in 2.6 and it would be pity to ignore them so
to address this, hence the What's New in 2.6 sections.
To some extent, these
sections presume you have read the rest of the book so only glance at them during
the rst reading.
If you decide to start reading 2.5 and 2.6 VM code, the basic
description of what to expect from the Whats New sections should greatly aid
your understanding.
2.6.0-test4
It is important to note that the sections are based on the
kernel which should not change change signicantly before 2.6.
As
they are still subject to change though, you should still treat the What's New
sections as guidelines rather than denite facts.
Companion CD
A companion CD is included with this book which is intended to be used on systems
with GNU/Linux installed. Mount the CD on
/cdrom
as followed;
root@joshua:/$ mount /dev/cdrom /cdrom -o exec
A copy of
Apache 1.3.27 (http://www.apache.org/ ) has been built and cong-
ured to run but it requires the CD be mounted on
script
/cdrom/start_server.
/cdrom/.
To start it, run the
If there are no errors, the output should look like:
mel@joshua:~$ /cdrom/start_server
Starting CodeViz Server: done
Starting Apache Server: done
The URL to access is http://localhost:10080/
If the server starts successfully, point your browser to
http://localhost:10080
to
avail of the CDs web services. Some features included with the CD are:
•
A web server started is available which is started by
After starting it, the URL to access is
/cdrom/start_server.
http://localhost:10080.
It has been
tested with Red Hat 7.3 and Debian Woody;
•
The whole book is included in HTML, PDF and plain text formats from
/cdrom/docs.
It includes a searchable index for functions that have a commen-
tary available. If a function is searched for that does not have a commentary,
the browser will be automatically redirected to LXR;
•
A web browsable copy of the Linux
2.4.22 source is available courtesy of LXR
Preface
•
•
iv
Generate call graphs with an online version of the
The
VM Regress, CodeViz
Chapter 1 are available in
is required for building
patchset
and
CodeViz tool.
packages which are discussed in
/cdrom/software. gcc-3.0.4
CodeViz.
To shutdown the server, run the script
is also provided as it
/cdrom/stop_server and the CD may
then be unmounted.
Typographic Conventions
The conventions used in this document are simple. New concepts that are introduced
as well as URLs are in
italicised
font. Binaries and package names are are in
Structures, eld names, compile time denes and variables are in a
bold.
constant-width
font. At times when talking about a eld in a structure, both the structure and eld
name will be included like
page→list
for example. Filenames are in a constant-
width font but include les have angle brackets around them like
and may be found in the
include/
<linux/mm.h>
directory of the kernel source.
Acknowledgments
The compilation of this book was not a trivial task. This book was researched and
developed in the open and it would be remiss of me not to mention some of the
people who helped me at various intervals. If there is anyone I missed, I apologise
now.
First, I would like to thank John O'Gorman who tragically passed away while
the material for this book was being researched. It was his experience and guidance
that largely inspired the format and quality of this book.
Secondly, I would like to thank Mark L. Taub from Prentice Hall PTR for giving
me the opportunity to publish this book. It has being a rewarding experience and it
made trawling through all the code worthwhile. Massive thanks go to my reviewers
who provided clear and detailed feedback long after I thought I had nished writing.
Finally, on the publishers front, I would like to thank Bruce Perens for allowing me to
publish under the Bruce Peren's Open Book Series (http://www.perens.com/Books ).
With the technical research, a number of people provided invaluable insight.
Abhishek Nayani, was a source of encouragement and enthusiasm early in the research. Ingo Oeser kindly provided invaluable assistance early on with a detailed
explanation on how data is copied from userspace to kernel space including some
valuable historical context.
He also kindly oered to help me if I felt I ever got
lost in the twisty maze of kernel code. Scott Kaplan made numerous corrections to
a number of systems from non-contiguous memory allocation, to page replacement
policy. Jonathon Corbet provided the most detailed account of the history of the
kernel development with the kernel page he writes for Linux Weekly News.
Zack
Brown, the chief behind Kernel Trac, is the sole reason I did not drown in kernel
Preface
v
related mail. IBM, as part of the Equinox Project, provided an xSeries 350 which
was invaluable for running my own test kernels on machines larger than what I previously had access to. Finally, Patrick Healy was crucial to ensuring that this book
was consistent and approachable to people who are familiar, but not experts, on
Linux or memory management.
A number of people helped with smaller technical issues and general inconsistencies where material was not covered in sucient depth. They are Muli Ben-Yehuda,
Parag Sharma, Matthew Dobson, Roger Luethi, Brian Lowe and Scott Crosby. All of
them sent corrections and queries on diernet parts of the document which ensured
too much prior knowledge was assumed.
Carl Spalletta sent a number of queries and corrections to every aspect of the
book in its earlier online form. Steve Greenland sent a large number of grammar
corrections. Philipp Marek went above and beyond being helpful sending over 90
separate corrections and queries on various aspects.
Long after I thought I was
nished, Aris Sotiropoulos sent a large number of small corrections and suggestions.
The last person, whose name I cannot remember but is an editor for a magazine
sent me over 140 corrections against an early version to the document. You know
who you are, thanks.
Eleven people sent a few corrections, though small, were still missed by several
of my own checks. They are Marek Januszewski, Amit Shah, Adrian Stanciu, Andy
Isaacson, Jean Francois Martinez, Glen Kaukola, Wolfgang Oertl, Michael Babcock,
Kirk True, Chuck Luciano and David Wilson.
On the development of VM Regress, there were nine people who helped me keep
it together. Danny Faught and Paul Larson both sent me a number of bug reports
and helped ensure it worked with a variety of dierent kernels. Cli White, from
the OSDL labs ensured that VM Regress would have a wider application than my
own test box.
Dave Olien, also associated with the OSDL labs was responsible
for updating VM Regress to work with 2.5.64 and later kernels.
Albert Cahalan
sent all the information I needed to make it function against later proc utilities.
Finally, Andrew Morton, Rik van Riel and Scott Kaplan all provided insight on
what direction the tool should be developed to be both valid and useful.
The last long list are people who sent me encouragement and thanks at various
intervals. They are Martin Bligh, Paul Rolland, Mohamed Ghouse, Samuel Chessman, Ersin Er, Mark Hoy, Michael Martin, Martin Gallwey, Ravi Parimi, Daniel
Codt, Adnan Sha, Xiong Quanren, Dave Airlie, Der Herr Hofrat, Ida Hallgren,
Manu Anand, Eugene Teo, Diego Calleja and Ed Cashin. Thanks, the encouragement was heartening.
In conclusion, I would like to thank a few people without whom, I would not
have completed this.
I would like to thank my parents who kept me going long
after I should have been earning enough money to support myself. I would like to
thank my girlfriend Karen, who patiently listened to rants, tech babble, angsting
over the book and made sure I was the person with the best toys. Kudos to friends
who dragged me away from the computer periodically and kept me relatively sane,
including Daren who is cooking me dinner as I write this. Finally, I would like to
thank the thousands of hackers that have contributed to GNU, the Linux kernel
Preface
vi
and other Free Software projects over the years who without I would not have an
excellent system to write about. It was an inspiration to me to see such dedication
when I rst started programming on my own PC 6 years ago after nally guring
out that Linux was not an application for Windows used for reading email.
Contents
List of Figures
xiii
List of Tables
xvi
1 Introduction
1
1.1
Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Managing the Source . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3
Browsing the Code
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.4
Reading the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.5
Submitting Patches . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2 Describing Physical Memory
14
2.1
Nodes
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2
Zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.3
Zone Initialisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.4
Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.5
High Memory
26
2.6
What's New In 2.6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Page Table Management
26
32
3.1
Describing the Page Directory . . . . . . . . . . . . . . . . . . . . . .
33
3.2
Describing a Page Table Entry . . . . . . . . . . . . . . . . . . . . . .
35
3.3
Using Page Table Entries . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.4
Translating and Setting Page Table Entries . . . . . . . . . . . . . . .
38
3.5
Allocating and Freeing Page Tables . . . . . . . . . . . . . . . . . . .
38
3.6
Kernel Page Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.7
Mapping addresses to a
3.8
Translation Lookaside Buer (TLB) . . . . . . . . . . . . . . . . . . .
42
3.9
Level 1 CPU Cache Management
. . . . . . . . . . . . . . . . . . . .
43
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.10 What's New In 2.6
struct page
. . . . . . . . . . . . . . . . . .
4 Process Address Space
41
52
4.1
Linear Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
Managing the Address Space . . . . . . . . . . . . . . . . . . . . . . .
54
4.3
Process Address Space Descriptor . . . . . . . . . . . . . . . . . . . .
54
vii
53
CONTENTS
viii
4.4
Memory Regions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.5
Exception Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
4.6
Page Faulting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
4.7
Copying To/From Userspace . . . . . . . . . . . . . . . . . . . . . . .
82
4.8
What's New in 2.6
84
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Boot Memory Allocator
89
5.1
Representing the Boot Map
. . . . . . . . . . . . . . . . . . . . . . .
90
5.2
Initialising the Boot Memory Allocator . . . . . . . . . . . . . . . . .
92
5.3
Allocating Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
5.4
Freeing Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
5.5
Retiring the Boot Memory Allocator
5.6
What's New in 2.6
. . . . . . . . . . . . . . . . . .
95
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
6 Physical Page Allocation
98
6.1
Managing Free Blocks
. . . . . . . . . . . . . . . . . . . . . . . . . .
98
6.2
Allocating Pages
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
6.3
Free Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4
Get Free Page (GFP) Flags
6.5
Avoiding Fragmentation
6.6
What's New In 2.6
. . . . . . . . . . . . . . . . . . . . . . . 103
. . . . . . . . . . . . . . . . . . . . . . . . . 106
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7 Non-Contiguous Memory Allocation
110
7.1
Describing Virtual Memory Areas . . . . . . . . . . . . . . . . . . . . 110
7.2
Allocating A Non-Contiguous Area
7.3
Freeing A Non-Contiguous Area . . . . . . . . . . . . . . . . . . . . . 113
7.4
Whats New in 2.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8 Slab Allocator
. . . . . . . . . . . . . . . . . . . 111
115
8.1
Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.2
Slabs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.3
Objects
8.4
Sizes Cache
8.5
Per-CPU Object Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.6
Slab Allocator Initialisation
8.7
Interfacing with the Buddy Allocator . . . . . . . . . . . . . . . . . . 142
8.8
Whats New in 2.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
. . . . . . . . . . . . . . . . . . . . . . . 141
9 High Memory Management
144
9.1
Managing the PKMap Address Space . . . . . . . . . . . . . . . . . . 144
9.2
Mapping High Memory Pages
9.3
Mapping High Memory Pages Atomically . . . . . . . . . . . . . . . . 147
9.4
Bounce Buers
9.5
Emergency Pools
9.6
What's New in 2.6
. . . . . . . . . . . . . . . . . . . . . . 145
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
CONTENTS
ix
10 Page Frame Reclamation
153
10.1 Page Replacement Policy . . . . . . . . . . . . . . . . . . . . . . . . . 154
10.2 Page Cache
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
10.3 LRU Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
10.4 Shrinking all caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
10.5 Swapping Out Process Pages . . . . . . . . . . . . . . . . . . . . . . . 163
kswapd)
10.6 Pageout Daemon (
. . . . . . . . . . . . . . . . . . . . . . . 164
10.7 What's New in 2.6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
11 Swap Management
167
11.1 Describing the Swap Area
. . . . . . . . . . . . . . . . . . . . . . . . 168
11.2 Mapping Page Table Entries to Swap Entries . . . . . . . . . . . . . . 171
11.3 Allocating a swap slot
. . . . . . . . . . . . . . . . . . . . . . . . . . 172
11.4 Swap Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
11.5 Reading Pages from Backing Storage
. . . . . . . . . . . . . . . . . . 176
11.6 Writing Pages to Backing Storage . . . . . . . . . . . . . . . . . . . . 177
11.7 Reading/Writing Swap Area Blocks . . . . . . . . . . . . . . . . . . . 178
11.8 Activating a Swap Area
. . . . . . . . . . . . . . . . . . . . . . . . . 179
11.9 Deactivating a Swap Area
. . . . . . . . . . . . . . . . . . . . . . . . 180
11.10Whats New in 2.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
12 Shared Memory Virtual Filesystem
182
12.1 Initialising the Virtual Filesystem . . . . . . . . . . . . . . . . . . . . 183
12.2 Using
shmem
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 184
12.3 Creating Files in
tmpfs
. . . . . . . . . . . . . . . . . . . . . . . . . . 187
12.4 Page Faulting within a Virtual File
tmpfs .
in tmpfs
. . . . . . . . . . . . . . . . . . . 188
12.5 File Operations in
. . . . . . . . . . . . . . . . . . . . . . . . 190
12.6 Inode Operations
. . . . . . . . . . . . . . . . . . . . . . . . 190
12.7 Setting up Shared Regions . . . . . . . . . . . . . . . . . . . . . . . . 191
12.8 System V IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
12.9 What's New in 2.6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
13 Out Of Memory Management
13.1 Checking Available Memory
13.2 Determining OOM Status
13.3 Selecting a Process
194
. . . . . . . . . . . . . . . . . . . . . . . 194
. . . . . . . . . . . . . . . . . . . . . . . . 195
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
13.4 Killing the Selected Process
. . . . . . . . . . . . . . . . . . . . . . . 196
13.5 Is That It? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
13.6 What's New in 2.6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
14 The Final Word
198
A Introduction
200
CONTENTS
x
B Describing Physical Memory
201
B.1
Initialising Zones
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
B.2
Page Operations
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
C Page Table Management
221
C.1
Page Table Initialisation
. . . . . . . . . . . . . . . . . . . . . . . . . 222
C.2
Page Table Walking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
D Process Address Space
232
D.1
Process Memory Descriptors . . . . . . . . . . . . . . . . . . . . . . . 236
D.2
Creating Memory Regions
D.3
Searching Memory Regions . . . . . . . . . . . . . . . . . . . . . . . . 293
D.4
Locking and Unlocking Memory Regions
D.5
Page Faulting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
D.6
Page-Related Disk IO . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
. . . . . . . . . . . . . . . . . . . . . . . . 243
. . . . . . . . . . . . . . . . 299
E Boot Memory Allocator
381
E.1
Initialising the Boot Memory Allocator . . . . . . . . . . . . . . . . . 382
E.2
Allocating Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
E.3
Freeing Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
E.4
Retiring the Boot Memory Allocator
. . . . . . . . . . . . . . . . . . 397
F Physical Page Allocation
404
F.1
Allocating Pages
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
F.2
Allocation Helper Functions
F.3
Free Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
F.4
Free Helper Functions
. . . . . . . . . . . . . . . . . . . . . . . 418
. . . . . . . . . . . . . . . . . . . . . . . . . . 425
G Non-Contiguous Memory Allocation
426
G.1
Allocating A Non-Contiguous Area
. . . . . . . . . . . . . . . . . . . 427
G.2
Freeing A Non-Contiguous Area . . . . . . . . . . . . . . . . . . . . . 437
H Slab Allocator
442
H.1
Cache Manipulation
H.2
Slabs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
. . . . . . . . . . . . . . . . . . . . . . . . . . . 444
H.3
Objects
H.4
Sizes Cache
H.5
Per-CPU Object Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 490
H.6
Slab Allocator Initialisation
H.7
Interfacing with the Buddy Allocator . . . . . . . . . . . . . . . . . . 499
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
. . . . . . . . . . . . . . . . . . . . . . . 498
I High Memory Mangement
500
I.1
Mapping High Memory Pages
I.2
Mapping High Memory Pages Atomically . . . . . . . . . . . . . . . . 508
. . . . . . . . . . . . . . . . . . . . . . 502
I.3
Unmapping Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
I.4
Unmapping High Memory Pages Atomically
. . . . . . . . . . . . . . 512
CONTENTS
xi
I.5
Bounce Buers
I.6
Emergency Pools
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
J Page Frame Reclamation
523
J.1
Page Cache Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 525
J.2
LRU List Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
J.3
Relling
J.4
Reclaiming Pages from the LRU Lists . . . . . . . . . . . . . . . . . . 542
J.5
Shrinking all caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
J.6
Swapping Out Process Pages . . . . . . . . . . . . . . . . . . . . . . . 554
J.7
Page Swap Daemon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
inactive_list
. . . . . . . . . . . . . . . . . . . . . . . . . 540
K Swap Management
570
K.1
Scanning for Free Entries . . . . . . . . . . . . . . . . . . . . . . . . . 572
K.2
Swap Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
K.3
Swap Area IO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
K.4
Activating a Swap Area
K.5
Deactivating a Swap Area
. . . . . . . . . . . . . . . . . . . . . . . . . 594
. . . . . . . . . . . . . . . . . . . . . . . . 606
L Shared Memory Virtual Filesystem
shmfs
L.1
Initialising
L.2
Creating Files in
620
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622
L.4
tmpfs . .
File Operations in tmpfs .
Inode Operations in tmpfs
L.5
Page Faulting within a Virtual File
L.6
Swap Space Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . 667
L.7
Setting up Shared Regions . . . . . . . . . . . . . . . . . . . . . . . . 674
L.8
System V IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
L.3
. . . . . . . . . . . . . . . . . . . . . . . . 628
. . . . . . . . . . . . . . . . . . . . . . . . 632
. . . . . . . . . . . . . . . . . . . . . . . . 646
M Out of Memory Management
M.1 Determining Available Memory
. . . . . . . . . . . . . . . . . . . 655
685
. . . . . . . . . . . . . . . . . . . . . 686
M.2 Detecting and Recovering from OOM . . . . . . . . . . . . . . . . . . 688
Reference
694
Bibliography
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694
Code Commentary Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 698
Index
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
Code Commentary Contents
xii
List of Figures
1.1
Example Patch
2.1
Relationship Between Nodes, Zones and Pages . . . . . . . . . . . . .
15
2.2
Zone Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.3
Call Graph:
. . . . . . . . . . . . . . . . . . . . . .
20
2.4
Sleeping On a Locked Page . . . . . . . . . . . . . . . . . . . . . . . .
21
2.5
Call Graph:
3.1
Page Table Layout
3.2
Linear Address Bit Size Macros
3.3
Linear Address Size and Mask Macros
. . . . . . . . . . . . . . . . .
34
3.4
Call Graph:
. . . . . . . . . . . . . . . . . . . . . . .
40
4.1
Kernel Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.2
Data Structures related to the Address Space
. . . . . . . . . . . . .
55
4.3
Memory Region Flags
. . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
setup_memory()
free_area_init()
. . . . . . . . . . . . . . . . . . . . .
23
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
paging_init()
. . . . . . . . . . . . . . . . . . . . .
sys_mmap2() . . . . .
Call Graph: get_unmapped_area()
Call Graph: insert_vm_struct() .
Call Graph: sys_mremap() . . . . .
Call Graph: move_vma() . . . . . .
Call Graph: move_page_tables() .
Call Graph: sys_mlock() . . . . .
Call Graph: do_munmap() . . . . .
Call Graph: do_page_fault() . . .
do_page_fault() Flow Diagram .
Call Graph: handle_mm_fault() .
Call Graph: do_no_page() . . . . .
Call Graph: do_swap_page() . . .
Call Graph: do_wp_page() . . . . .
Call Graph:
5.3
alloc_bootmem() . .
Call Graph: mem_init() . . . . .
Initialising mem_map and the Main
6.1
5.1
5.2
7
Call Graph:
. . . . . . . . . . . . . . . . . . .
34
67
. . . . . . . . . . . . . . . . . . .
68
. . . . . . . . . . . . . . . . . . .
70
. . . . . . . . . . . . . . . . . . .
72
. . . . . . . . . . . . . . . . . . .
72
. . . . . . . . . . . . . . . . . . .
73
. . . . . . . . . . . . . . . . . . .
74
. . . . . . . . . . . . . . . . . . .
74
. . . . . . . . . . . . . . . . . . .
78
. . . . . . . . . . . . . . . . . . .
79
. . . . . . . . . . . . . . . . . . .
80
. . . . . . . . . . . . . . . . . . .
80
. . . . . . . . . . . . . . . . . . .
81
. . . . . . . . . . . . . . . . . . .
82
. . . . . . . . . . . . . . . . . . . .
93
. . . . . . . . . . . . . . . . . . . .
95
Physical Page Allocator
. . . . . .
96
Free page block management . . . . . . . . . . . . . . . . . . . . . . .
99
xiii
LIST OF FIGURES
xiv
6.2
Allocating physical pages . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3
Call Graph:
6.4
Call Graph:
7.1
vmalloc Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2
Call Graph:
7.3
Relationship between
and Page Faulting
7.4
Call Graph:
. . . . . . . . . . . . 114
8.1
Layout of the Slab Allocator . . . . . . . . . . . . . . . . . . . . . . . 116
8.2
Slab page containing Objects Aligned to L1 CPU Cache
8.3
Call Graph:
8.4
Call Graph:
8.5
Call Graph:
8.6
Call Graph:
8.7
Call Graph:
8.8
Page to Cache and Slab Relationship
8.9
Slab With Descriptor On-Slab . . . . . . . . . . . . . . . . . . . . . . 131
alloc_pages() .
__free_pages()
. . . . . . . . . . . . . . . . . . . . . . 101
. . . . . . . . . . . . . . . . . . . . . . 102
vmalloc() . . . . . . . . . . . . . .
vmalloc(), alloc_page()
vfree() . . . . . . . . . . . . . . .
kmem_cache_create() . .
kmem_cache_reap() . . .
kmem_cache_shrink() . .
__kmem_cache_shrink() .
kmem_cache_destroy() .
. . . . . . . . . . . . 112
. 113
. . . . . . . 117
. . . . . . . . . . . . . . . . . 126
. . . . . . . . . . . . . . . . . 127
. . . . . . . . . . . . . . . . . 128
. . . . . . . . . . . . . . . . . 128
. . . . . . . . . . . . . . . . . 129
. . . . . . . . . . . . . . . . . . 130
8.10 Slab With Descriptor O-Slab . . . . . . . . . . . . . . . . . . . . . . 132
kmem_cache_grow() .
Initialised kmem_bufctl_t Array
.
Call Graph: kmem_slab_destroy()
Call Graph: kmem_cache_alloc() .
Call Graph: kmem_cache_free() .
Call Graph: kmalloc() . . . . . . .
Call Graph: kfree() . . . . . . . .
8.11 Call Graph:
. . . . . . . . . . . . . . . . . . . 132
8.12
. . . . . . . . . . . . . . . . . . . 133
8.13
8.14
8.15
8.16
8.17
. . . . . . . . . . . . . . . . . . . 135
. . . . . . . . . . . . . . . . . . . 136
. . . . . . . . . . . . . . . . . . . 136
. . . . . . . . . . . . . . . . . . . 138
. . . . . . . . . . . . . . . . . . . 138
kmap() . . . . . . . . . . . . . .
kunmap() . . . . . . . . . . . .
create_bounce() . . . . . . . .
bounce_end_io_read/write()
9.1
Call Graph:
9.2
Call Graph:
9.3
Call Graph:
9.4
Call Graph:
9.5
Acquiring Pages from Emergency Pools . . . . . . . . . . . . . . . . . 151
10.1 Page Cache LRU Lists
10.2 Call Graph:
10.3 Call Graph:
10.4 Call Graph:
10.5 Call Graph:
10.6 Call Graph:
. . . . . . . . . . . . . . 146
. . . . . . . . . . . . . . 148
. . . . . . . . . . . . . . 149
. . . . . . . . . . . . . . 150
. . . . . . . . . . . . . . . . . . . . . . . . . . 155
generic_file_read()
add_to_page_cache()
shrink_caches() . . .
swap_out() . . . . . .
kswapd() . . . . . . .
. . . . . . . . . . . . . . . . . . . 158
. . . . . . . . . . . . . . . . . . . 159
. . . . . . . . . . . . . . . . . . . 163
. . . . . . . . . . . . . . . . . . . 163
. . . . . . . . . . . . . . . . . . . 165
swp_entry_t
get_swap_page() . . . . . . . . . . .
add_to_swap_cache() . . . . . . . .
11.1 Storing Swap Entry Information in
. . . . . . . . . . . 172
11.2 Call Graph:
. . . . . . . . . . . 173
11.3 Call Graph:
. . . . . . . . . . . 174
11.4 Adding a Page to the Swap Cache . . . . . . . . . . . . . . . . . . . . 175
11.5 Call Graph:
11.6 Call Graph:
read_swap_cache_async()
sys_writepage() . . . . . .
. . . . . . . . . . . . . . . . 177
. . . . . . . . . . . . . . . . 178
LIST OF FIGURES
12.1 Call Graph:
12.2 Call Graph:
12.3 Call Graph:
xv
init_tmpfs() . .
shmem_create()
shmem_nopage()
. . . . . . . . . . . . . . . . . . . . . . 183
. . . . . . . . . . . . . . . . . . . . . . 187
. . . . . . . . . . . . . . . . . . . . . . 188
12.4 Traversing Indirect Blocks in a Virtual File . . . . . . . . . . . . . . . 189
12.6 Call Graph:
shmem_zero_setup() .
sys_shmget() . . . . .
13.1 Call Graph:
out_of_memory() .
12.5 Call Graph:
. . . . . . . . . . . . . . . . . . . 191
. . . . . . . . . . . . . . . . . . . 192
. . . . . . . . . . . . . . . . . . . . . 195
14.1 Broad Overview on how VM Sub-Systems Interact . . . . . . . . . . . 199
D.1
Call Graph:
mmput()
E.1
Call Graph:
free_bootmem()
H.1
Call Graph:
enable_all_cpucaches()
. . . . . . . . . . . . . . . . . . . . . . . . . . . 241
. . . . . . . . . . . . . . . . . . . . . . 395
. . . . . . . . . . . . . . . . . 490
List of Tables
1.1
Kernel size as an indicator of complexity
. . . . . . . . . . . . . . . .
2.1
Flags Describing Page Status
2.2
Macros For Testing, Setting and Clearing
.
31
3.1
Page Table Entry Protection and Status Bits . . . . . . . . . . . . . .
36
3.2
Translation Lookaside Buer Flush API
43
3.3
Translation Lookaside Buer Flush API (cont) . . . . . . . . . . . . .
44
3.4
Cache and TLB Flush Ordering . . . . . . . . . . . . . . . . . . . . .
45
3.5
CPU Cache Flush API . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.6
CPU D-Cache and I-Cache Flush API
47
4.1
System Calls Related to Memory Regions . . . . . . . . . . . . . . . .
56
4.2
Functions related to memory region descriptors
59
4.3
Memory Region VMA API . . . . . . . . . . . . . . . . . . . . . . . .
69
4.4
Reasons For Page Faulting . . . . . . . . . . . . . . . . . . . . . . . .
77
4.5
Accessing Process Address Space API . . . . . . . . . . . . . . . . . .
83
5.1
Boot Memory Allocator API for UMA Architectures
. . . . . . . . .
90
5.2
Boot Memory Allocator API for NUMA Architectures . . . . . . . . .
91
6.1
Physical Pages Allocation API . . . . . . . . . . . . . . . . . . . . . . 100
6.2
Physical Pages Free API
6.3
Low Level GFP Flags Aecting Zone Allocation . . . . . . . . . . . . 103
6.4
Low Level GFP Flags Aecting Allocator behaviour . . . . . . . . . . 104
6.5
Low Level GFP Flag Combinations For High Level Use . . . . . . . . 104
6.6
High Level GFP Flags Aecting Allocator Behaviour
6.7
Process Flags Aecting Allocator behaviour
7.1
Non-Contiguous Memory Allocation API . . . . . . . . . . . . . . . . 112
7.2
Non-Contiguous Memory Free API
8.1
Slab Allocator API for caches
8.2
Internal cache static ags . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.3
Cache static ags set by caller . . . . . . . . . . . . . . . . . . . . . . 123
8.4
Cache static debug ags
8.5
Cache Allocation Flags . . . . . . . . . . . . . . . . . . . . . . . . . . 124
. . . . . . . . . . . . . . . . . . . . . .
page→flags
Status Bits
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
1
30
. . . . . . . . . . . . . . . . . . . . . . . . . 102
. . . . . . . . . 105
. . . . . . . . . . . . . . 106
. . . . . . . . . . . . . . . . . . . 113
. . . . . . . . . . . . . . . . . . . . . . 118
. . . . . . . . . . . . . . . . . . . . . . . . . 124
xvi
LIST OF TABLES
xvii
8.6
Cache Constructor Flags . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.1
High Memory Mapping API
9.2
High Memory Unmapping API
. . . . . . . . . . . . . . . . . . . . . . . 147
. . . . . . . . . . . . . . . . . . . . . 147
10.1 Page Cache API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
10.2 LRU List API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
11.1 Swap Cache API
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Chapter 1
Introduction
Linux is a relatively new operating system that has begun to enjoy a lot of attention
from the business, academic and free software worlds.
As the operating system
matures, its feature set, capabilities and performance grow but so, out of necessity
does its size and complexity. The table in Figure 1.1 shows the size of the kernel
source code in bytes and lines of code of the
mm/
part of the kernel tree. This does
not include the machine dependent code or any of the buer management code and
does not even pretend to be an accurate metric for complexity but still serves as a
small indicator.
Version
Release Date
Total Size Size of mm/ Line count
1.0
March 13th, 1992
5.9MiB
96KiB
3109
1.2.13
February 8th, 1995
11MiB
136KiB
4531
2.0.39
January 9th 2001
35MiB
204KiB
6792
2.2.22
September 16th, 2002
93MiB
292KiB
9554
2.4.22
August 25th, 2003
181MiB
436KiB
15724
2.6.0-test4
August 22nd, 2003
261MiB
604KiB
21714
Table 1.1: Kernel size as an indicator of complexity
As is the habit of open source developers in general, new developers asking questions are sometimes told to refer directly to the source with the polite
acronym RTFS
1
or else are referred to the kernel newbies mailing list
(http://www.kernelnewbies.org ). With the Linux Virtual Memory (VM) manager,
this used to be a suitable response as the time required to understand the VM could
be measured in weeks and the books available devoted enough time to the memory
management chapters to make the relatively small amount of code easy to navigate.
The books that describe the operating system such as
Kernel
Understanding the Linux
[BC00] [BC03], tend to cover the entire kernel rather than one topic with the
notable exception of device drivers [RC01]. These books, particularly
Understanding
1 Read The Flaming Source. It doesn't really stand for Flaming but there could be children
watching.
1
1.1 Getting Started
the Linux Kernel,
2
provide invaluable insight into kernel internals but they miss the
details which are specic to the VM and not of general interest.
ZONE_NORMAL
is detailed in this book why
For example, it
is exactly 896MiB and exactly how per-
cpu caches are implemented. Other aspects of the VM, such as the boot memory
allocator and the virtual memory lesystem which are not of general kernel interest
are also covered by this book.
Increasingly, to get a comprehensive view on how the kernel functions, one is
required to read through the source code line by line. This book tackles the VM
specically so that this investment of time to understand it will be measured in
weeks and not months. The details which are missed by the main part of the book
will be caught by the code commentary.
In this chapter, there will be in informal introduction to the basics of acquiring
information on an open source project and some methods for managing, browsing
and comprehending the code. If you do not intend to be reading the actual source,
you may skip to Chapter 2.
1.1 Getting Started
One of the largest initial obstacles to understanding code is deciding where to start
and how to easily manage, browse and get an overview of the overall code structure.
If requested on mailing lists, people will provide some suggestions on how to proceed
but a comprehensive methodology is rarely oered aside from suggestions to keep
reading the source until it makes sense. In the following sections, some useful rules
of thumb for open source code comprehension will be introduced and specically on
how they may be applied to the kernel.
1.1.1 Conguration and Building
With any open source project, the rst step is to download the source and read
the installation documentation.
INSTALL
By convention, the source will have a
le at the top-level of the source tree [FF02].
build tools such as
README
or
In fact, some automated
automake require the install le to exist.
These les will contain
instructions for conguring and installing the package or will give a reference to
where more information may be found. Linux is no exception as it includes a
README
which describes how the kernel may be congured and built.
The second step is to build the software.
many projects was to edit the
Makefile
Free software usually uses at least
environment and
automake
3
In earlier days, the requirement for
by hand but this is rarely the case now.
autoconf 2
to automate testing of the build
to simplify the creation of
often as simple as:
mel@joshua: project $ ./configure && make
2 http://www.gnu.org/software/autoconf/
3 http://www.gnu.org/software/automake/
Makefiles
so building is
1.1.2 Sources of Information
3
Some older projects, such as the Linux kernel, use their own conguration tools
and some large projects such as the Apache webserver have numerous conguration
options but usually the congure script is the starting point.
kernel, the conguration is handled by the
Makefiles
In the case of the
and supporting tools.
The
simplest means of conguration is to:
mel@joshua: linux-2.4.22 $ make config
This asks a long series of questions on what type of kernel should be built. Once
all the questions have been answered, compiling the kernel is simply:
mel@joshua: linux-2.4.22 $ make bzImage && make modules
A comprehensive guide on conguring and compiling a kernel is available with
4
the Kernel HOWTO
and will not be covered in detail with this book. For now, we
will presume you have one fully built kernel and it is time to begin guring out how
the new kernel actually works.
1.1.2 Sources of Information
Open Source projects will usually have a home page, especially since free project
hosting sites such as
http://www.sourceforge.net
are available. The home site will
contain links to available documentation and instructions on how to join the mailing
list, if one is available. Some sort of documentation will always exist, even if it is
as minimal as a simple
README
le, so read whatever is available.
is old and reasonably large, the web site will probably feature a
Questions (FAQ) .
If the project
Frequently Asked
Next, join the development mailing list and lurk, which means to subscribe to a
mailing list and read it without posting. Mailing lists are the preferred form of devel-
Internet Relay Chat (IRC) and
UseNet . As mailing lists often contain
oper communication followed by, to a lesser extent,
online newgroups, commonly referred to as
discussions on implementation details, it is important to read at least the previous
months archives to get a feel for the developer community and current activity. The
mailing list archives should be the rst place to search if you have a question or
query on the implementation that is not covered by available documentation. If you
have a question to ask the developers, take time to research the questions and ask
it the Right Way [RM01]. While there are people who will answer obvious questions, it will not do your credibility any favours to be constantly asking questions
that were answered a week previously or are clearly documented.
Now, how does all this apply to Linux? First, the documentation. There is a
README at the top of the source tree and a wealth of information is available in the
Documentation/ directory. There also is a number of books on UNIX design [Vah96],
Linux specically [BC00] and of course this book to explain what to expect in the
code.
4 http://www.tldp.org/HOWTO/Kernel-HOWTO/index.html
1.2 Managing the Source
4
ne of the best online sources of information available on kernel development is
the Kernel Page in the weekly edition of
Linux Weekly News (http://www.lwn.net) .
It also reports on a wide range of Linux related topics and is worth a regular read.
The kernel does not have a home web site as such but the closest equivalent is
http://www.kernelnewbies.org
which is a vast source of information on the kernel
that is invaluable to new and experienced people alike.
here is a FAQ available for the
Linux Kernel Mailing List (LKML) at http://www.tux.org/lkml/
that covers questions, ranging from the kernel development process to how to join
the list itself. The list is archived at many sites but a common choice to reference
is
http://marc.theaimsgroup.com/?l=linux-kernel.
Be aware that the mailing list is
very high volume list which can be a very daunting read but a weekly summary is
provided by the
Kernel Trac
site at
http://kt.zork.net/kernel-trac/.
The sites and sources mentioned so far contain general kernel information but
there are memory management specic sources.
at
http://www.linux-mm.org
There is a Linux-MM web site
which contains links to memory management specic
documentation and a linux-mm mailing list. The list is relatively light in comparison
http://mail.nl.linux.org/linux-mm/.
The last site that to consult is the Kernel Trap site at http://www.kerneltrap.org.
to the main list and is archived at
The site contains many useful articles on kernels in general.
It is not specic to
Linux but it does contain many Linux related articles and interviews with kernel
developers.
As is clear, there is a vast amount of information that is available that may be
consulted before resorting to the code. With enough experience, it will eventually
be faster to consult the source directly but when getting started, check other sources
of information rst.
1.2 Managing the Source
The mainline or stock kernel is principally distributed as a compressed tape archive
(.tar.bz) le which is available from your nearest kernel source repository, in Ireland's
case
ftp://ftp.ie.kernel.org/.
The stock kernel is always considered to be the one
released by the tree maintainer. For example, at time of writing, the stock kernels
5
for 2.2.x are those released by Alan Cox , for 2.4.x by Marcelo Tosatti and for 2.5.x by
Linus Torvalds. At each release, the full tar le is available as well as a smaller
patch
which contains the dierences between the two releases. Patching is the preferred
method of upgrading because of bandwidth considerations. Contributions made to
the kernel are almost always in the form of patches which are
by the GNU tool
Why patches
di .
unied dis
generated
Sending patches to the mailing list initially sounds clumsy but it is
remarkable ecient in the kernel development environment. The principal advantage
of patches is that it is much easier to read what changes have been made than to
5 Last minute update, Alan is just after announcing he was going on sabbatical and will no
longer maintain the 2.2.x tree. There is no maintainer at the moment.
1.2 Managing the Source
5
compare two full versions of a le side by side. A developer familiar with the code
can easily see what impact the changes will have and if it should be merged.
In
addition, it is very easy to quote the email that includes the patch and request more
information about it.
Subtrees
At various intervals, individual inuential developers may have their
own version of the kernel distributed as a large patch to the main tree.
These
subtrees generally contain features or cleanups which have not been merged to the
-rmap
maintained by Rik Van Riel, a long time inuential VM developer and the -mm
mainstream yet or are still being tested.
Two notable subtrees is the
tree
tree
maintained by Andrew Morton, the current maintainer of the stock development
VM. The -rmap tree contains a large set of features that for various reasons are not
available in the mainline. It is heavily inuenced by the FreeBSD VM and has a
number of signicant dierences to the stock VM. The -mm tree is quite dierent to
-rmap in that it is a testing tree with patches that are being tested before merging
into the stock kernel.
BitKeeper
In more recent times, some developers have started using a source code
control system called BitKeeper (http://www.bitmover.com ), a proprietary version
control system that was designed with the Linux as the principal consideration.
BitKeeper allows developers to have their own distributed version of the tree and
other users may pull sets of patches called
changesets
from each others trees. This
distributed nature is a very important distinction from traditional version control
software which depends on a central server.
BitKeeper allows comments to be associated with each patch which is displayed
as part of the release information for each kernel. For Linux, this means that the
email that originally submitted the patch is preserved making the progress of kernel
development and the meaning of dierent patches a lot more transparent. On release,
a list of the patch titles from each developer is announced as well as a detailed list
of all patches included.
As BitKeeper is a proprietary product, email and patches are still considered the
only method for generating discussion on code changes. In fact, some patches will
not be considered for acceptance unless there is rst some discussion on the main
mailing list as code quality is considered to be directly related to the amount of peer
review [Ray02].
As the BitKeeper maintained source tree is exported in formats
accessible to open source tools like CVS, patches are still the preferred means of
discussion.
It means that no developer is required to use BitKeeper for making
contributions to the kernel but the tool is still something that developers should be
aware of.
1.2.1 Di and Patch
6
1.2.1 Di and Patch
The two tools for creating and applying patches are
6
are GNU utilities available from the GNU website .
and
patch is used to apply them.
a preferred usage.
Patches generated with
di
di and patch, both of which
di is used to generate patches
While the tools have numerous options, there is
should always be
unied di, include the C function
that the change aects and be generated from one directory above the kernel source
root. A unied di include more information that just the dierences between two
lines. It begins with a two line header with the names and creation date of the two
les that
di
is comparing. After that, the di will consist of one or more hunks.
The beginning of each hunk is marked with a line beginning with
@@ which includes
the starting line in the source code and how many lines there is before and after
the hunk is applied. The hunk includes context lines which show lines above and
below the changes to aid a human reader. Each line begins with a
the mark is
+,
the line is added. If a
-,
+, -
or blank. If
the line is removed and a blank is to leave
the line alone as it is there just to provide context. The reasoning behind generating
from one directory above the kernel root is that it is easy to see quickly what version
the patch has been applied against and it makes the scripting of applying patches
easier if each patch is generated the same way.
Let us take for example, a very simple change has been made to
mm/page_alloc.c
which adds a small piece of commentary. The patch is generated as follows. Note
that this command should be all one one line minus the backslashes.
mel@joshua: kernels/ $ diff -up
\
linux-2.4.22-clean/mm/page_alloc.c \
linux-2.4.22-mel/mm/page_alloc.c > example.patch
This generates a unied context di (-u switch) between two les and places the
patch in
example.patch
as shown in Figure 1.2.1. It also displays the name of the
aected C function.
From this patch, it is clear even at a casual glance what les are aected
(page_alloc.c), what line it starts at (76) and the new lines added are clearly
marked with a + .
In a patch, there may be several hunks which are marked
with a line starting with @@ . Each hunk will be treated separately during patch
application.
Broadly speaking, patches come in two varieties; plain text such as the one above
which are sent to the mailing list and compressed patches that are compressed
with either
gzip
(.gz extension) or
bzip2
(.bz2 extension).
It is usually safe to
assume that patches were generated one directory above the root of the kernel source
tree. This means that while the patch is generated one directory above, it may be
applied with the option
6 http://www.gnu.org
-p1 while the current directory is the kernel source tree root.
1.2.1 Di and Patch
7
--- linux-2.4.22-clean/mm/page_alloc.c Thu Sep 4 03:53:15 2003
+++ linux-2.4.22-mel/mm/page_alloc.c Thu Sep 3 03:54:07 2003
@@ -76,8 +76,23 @@
* triggers coalescing into a block of larger size.
*
* -- wli
+ *
+ * There is a brief explanation of how a buddy algorithm works at
+ * http://www.memorymanagement.org/articles/alloc.html . A better idea
+ * is to read the explanation from a book like UNIX Internals by
+ * Uresh Vahalia
+ *
*/
+/**
+ *
+ * __free_pages_ok - Returns pages to the buddy allocator
+ * @page: The first page of the block to be freed
+ * @order: 2^order number of pages are freed
+ *
+ * This function returns the pages allocated by __alloc_pages and tries to
+ * merge buddies if possible. Do not call directly, use free_pages()
+ **/
static void FASTCALL(__free_pages_ok (struct page *page, unsigned int order));
static void __free_pages_ok (struct page *page, unsigned int order)
{
Figure 1.1: Example Patch
Broadly speaking, this means a plain text patch to a clean tree can be easily
applied as follows:
mel@joshua: kernels/ $ cd linux-2.4.22-clean/
mel@joshua: linux-2.4.22-clean/ $ patch -p1 < ../example.patch
patching file mm/page_alloc.c
mel@joshua: linux-2.4.22-clean/ $
To apply a compressed patch, it is a simple extension to just decompress the
patch to standard out (stdout) rst.
mel@joshua: linux-2.4.22-mel/ $ gzip -dc ../example.patch.gz | patch -p1
If a hunk can be applied but the line numbers are dierent, the hunk number
and the number of lines needed to oset will be output. These are generally safe
warnings and may be ignored. If there are slight dierences in the context, it will be
1.2.2 Basic Source Management with PatchSet
8
applied and the level of fuzziness will be printed which should be double checked.
If a hunk fails to apply, it will be saved to
be saved to
filename.c.orig
filename.c.rej and the original le will
and have to be applied manually.
1.2.2 Basic Source Management with PatchSet
The untarring of sources, management of patches and building of kernels is initially interesting but quickly palls.
To cut down on the tedium of patch man-
agement, a simple tool was developed while writing this book called
PatchSet
which is designed the easily manage the kernel source and patches eliminating
a large amount of the tedium.
It is fully documented and freely available from
http://www.csn.ul.ie/∼mel/projects/patchset/
Downloading
and on the companion CD.
Downloading kernels and patches in itself is quite tedious and
scripts are provided to make the task simpler.
First,
the conguration le
etc/patchset.conf should be edited and the KERNEL_MIRROR parameter updated
for your local http://www.kernel.org/ mirror. Once that is done, use the script
download to download patches and kernel sources.
A simple use of the script is as
follows
mel@joshua: patchset/ $ download 2.4.18
# Will download the 2.4.18 kernel source
mel@joshua: patchset/ $ download -p 2.4.19
# Will download a patch for 2.4.19
mel@joshua: patchset/ $ download -p -b 2.4.20
# Will download a bzip2 patch for 2.4.20
Once the relevant sources or patches have been downloaded, it is time to congure
a kernel build.
Conguring Builds
Files called
set conguration les
are used to specify what
kernel source tar to use, what patches to apply, what kernel conguration (generated
by
make cong)
to use and what the resulting kernel is to be called.
specication le to build kernel
linux-2.4.18.tar.gz
2.4.20-rmap15f
config_generic
1 patch-2.4.19.gz
1 patch-2.4.20.bz2
1 2.4.20-rmap15f
2.4.20-rmap15f
is;
A sample
1.2.2 Basic Source Management with PatchSet
9
linux-2.4.18.tar.gz.
2.4.20-rmap15f. 2.4.20
This rst line says to unpack a source tree starting with
The second line species that the kernel will be called
was selected for this example as rmap patches against a later stable release were
not available at the time of writing.
http://surriel.com/patches/.
To check for updated rmap patches, see
The third line species which kernel
.config
le to
use for compiling the kernel. Each line after that has two parts. The rst part says
what patch depth to use i.e. what number to use with the -p switch to patch. As
discussed earlier in Section 1.2.1, this is usually 1 for applying patches while in the
source directory. The second is the name of the patch stored in the patches directory. The above example will apply two patches to update the kernel from
to
2.4.20
before building the
2.4.20-rmap15f
kernel tree.
If the kernel conguration le required is very simple, then use the
2.4.18
createset
script to generate a set le for you. It simply takes a kernel version as a parameter
and guesses how to build it based on available sources and patches.
mel@joshua: patchset/ $ createset 2.4.20
Building a Kernel The package comes with
called make-kernel.sh, will unpack the kernel
build it if requested.
three scripts.
to the
The rst script,
kernels/
directory and
If the target distribution is Debian, it can also create De-
bian packages for easy installation by specifying the
make-gengraph.sh,
-d
switch. The second, called
will unpack the kernel but instead of building an installable
kernel, it will generate the les required to use
section, for creating call graphs. The last, called
CodeViz,
discussed in the next
make-lxr.sh,
will install a kernel
for use with LXR.
Generating Dis
Ultimately, you will need to see the dierence between les in
two trees or generate a di of changes you have made yourself. Three small scripts
are provided to make this task easier. The rst is
tree to compare from. The second is
setworking
setclean
which sets the source
to set the path of the kernel tree
you are comparing against or working on. The third is
ditree which will generate
dis against les or directories in the two trees. To generate the di shown in Figure
1.2.1, the following would have worked;
mel@joshua: patchset/ $ setclean linux-2.4.22-clean
mel@joshua: patchset/ $ setworking linux-2.4.22-mel
mel@joshua: patchset/ $ difftree mm/page_alloc.c
The generated di is a unied di with the C function context included and
complies with the recommended usage of
di .
Two additional scripts are available
which are very useful when tracking changes between two trees. They are
and
difunc.
distruct
These are for printing out the dierences between individual struc-
tures and functions. When used rst, the
-f
switch must be used to record what
source le the structure or function is declared in but it is only needed the rst time.
1.3 Browsing the Code
10
1.3 Browsing the Code
When code is small and manageable, it is not particularly dicult to browse through
the code as operations are clustered together in the same le and there is not much
coupling between modules. The kernel unfortunately does not always exhibit this
behaviour. Functions of interest may be spread across multiple les or contained as
inline functions in headers. To complicate matters, les of interest may be buried
beneath architecture specic directories making tracking them down time consuming.
One solution for easy code browsing is
ctags(http://ctags.sourceforge.net/ )
which generates tag les from a set of source les.
These tags can be used to
jump to the C le and line where the identier is declared with editors such as
Vi
and
Emacs.
In the event there is multiple instances of the same tag, such as
with multiple functions with the same name, the correct one may be selected from
a list. This method works best when one is editing the code as it allows very fast
navigation through the code to be conned to one terminal window.
A more friendly browsing method is available with the
(LXR) tool hosted at http://lxr.linux.no/.
source code as browsable web pages.
Linux Cross-Referencing
This tool provides the ability to represent
Identiers such as global variables, macros
and functions become hyperlinks. When clicked, the location where it is dened is
displayed along with every le and line referencing the denition. This makes code
navigation very convenient and is almost essential when reading the code for the
rst time.
The tool is very simple to install and and browsable version of the kernel
2.4.22
source is available on the CD included with this book. All code extracts throughout
the book are based on the output of LXR so that the line numbers would be clearly
visible in excerpts.
1.3.1 Analysing Code Flow
As separate modules share code across multiple C les, it can be dicult to see
what functions are aected by a given code path without tracing through all the
code manually. For a large or deep code path, this can be extremely time consuming
to answer what should be a simple question.
One simple, but eective tool to use is
CodeViz
which is a call graph gen-
erator and is included with the CD. It uses a modied compiler for either C or
C++ to collect information necessary to generate the graph. The tool is hosted at
http://www.csn.ul.ie/∼mel/projects/codeviz/.
During compilation with the modied compiler, les with a
generated for each C le.
This
.cdep
.cdep extension are
le contains all function declarations and
genfull to
generate a full call graph of the entire source code which can be rendered with dot,
part of the GraphViz project hosted at http://www.graphviz.org/.
calls made in the C le. These les are distilled with a program called
In the kernel compiled for the computer this book was written on, there were a
total of 40,165 entries in the
full.graph le generated by genfull.
This call graph
1.3.2 Simple Graph Generation
11
is essentially useless on its own because of its size so a second tool is provided called
gengraph.
This program, at basic usage, takes the name of one or more functions
as an argument and generates postscript le with the call graph of the requested
function as the root node. The postscript le may be viewed with
gv.
ghostview
or
The generated graphs can be to unnecessary depth or show functions that the
user is not interested in, therefore there are three limiting options to graph generation. The rst is limit by depth where functions that are greater than
N
levels deep
in a call chain are ignored. The second is to totally ignore a function so it will not
appear on the call graph or any of the functions they call. The last is to display
a function, but not traverse it which is convenient when the function is covered on
a separate call graph or is a known API whose implementation is not currently of
interest.
All call graphs shown in these documents are generated with the
CodeViz tool
as it is often much easier to understand a subsystem at rst glance when a call graph
is available. It has been tested with a number of other open source projects based
on C and has wider application than just the kernel.
1.3.2 Simple Graph Generation
If both
PatchSet and CodeViz are installed, the rst call graph in this book shown
in Figure 3.4 can be generated and viewed with the following set of commands. For
brevity, the output of the commands is omitted:
mel@joshua:
mel@joshua:
mel@joshua:
mel@joshua:
mel@joshua:
patchset $ download 2.4.22
patchset $ createset 2.4.22
patchset $ make-gengraph.sh 2.4.22
patchset $ cd kernels/linux-2.4.22
linux-2.4.22 $ gengraph -t -s "alloc_bootmem_low_pages \
zone_sizes_init" -f paging_init
mel@joshua: linux-2.4.22 $ gv paging_init.ps
1.4 Reading the Code
When a new developer or researcher asks how to start reading the code, they are
often recommended to start with the initialisation code and work from there. This
may not be the best approach for everyone as initialisation is quite architecture
dependent and requires detailed hardware knowledge to decipher it.
It also gives
very little information on how a subsystem like the VM works as it is during the late
stages of initialisation that memory is set up in the way the running system sees it.
The best starting point to understanding the VM is this book and the code
commentary.
It describes a VM that is reasonably comprehensive without being
overly complicated. Later VMs are more complex but are essentially extensions of
the one described here.
1.5 Submitting Patches
12
For when the code has to be approached afresh with a later VM, it is always best
to start in an isolated region that has the minimum number of dependencies. In the
case of the VM, the best starting point is the
mm/oom_kill.c.
Out Of Memory (OOM)
manager in
It is a very gentle introduction to one corner of the VM where a
process is selected to be killed in the event that memory in the system is low. It
is because it touches so many dierent aspects of the VM that is covered last in
this book! The second subsystem to then examine is the non-contiguous memory
mm/vmalloc.c
allocator located in
and discussed in Chapter 7 as it is reasonably
contained within one le. The third system should be physical page allocator located
in
mm/page_alloc.c
and discussed in Chapter 6 for similar reasons.
The fourth
system of interest is the creation of VMAs and memory areas for processes discussed
in Chapter 4. Between these systems, they have the bulk of the code patterns that
are prevalent throughout the rest of the kernel code making the deciphering of more
complex systems such as the page replacement policy or the buer IO much easier
to comprehend.
The second recommendation that is given by experienced developers is to benchmark and test the VM. There are many benchmark programs available but com-
ConTest(http://members.optusnet.com.au/ckolivas/contest/ ),
SPEC(http://www.specbench.org/ ), lmbench(http://www.bitmover.com/lmbench/
and dbench(http://freshmeat.net/projects/dbench/ ).
For many purposes, these
monly used ones are
benchmarks will t the requirements.
Unfortunately it is dicult to test just the VM accurately and benchmarking
it is frequently based on timing a task such as a kernel compile.
VM Regress
is available at
http://www.csn.ul.ie/∼mel/vmregress/
A tool called
that lays the
foundation required to build a fully edged testing, regression and benchmarking
tool for the VM. It uses a combination of kernel modules and userspace tools to
test small parts of the VM in a reproducible manner and has one benchmark for
testing the page replacement policy using a large reference string.
It is intended
as a framework for the development of a testing utility and has a number of Perl
libraries and helper kernel modules to do much of the work but is still in the early
stages of development so use with care.
1.5 Submitting Patches
There are two les,
SubmittingPatches and CodingStyle, in the Documentation/
directory which cover the important basics. However, there is very little documentation describing how to get patches merged. This section will give a brief introduction
on how, broadly speaking, patches are managed.
First and foremost, the coding style of the kernel needs to be adhered to as
having a style inconsistent with the main kernel will be a barrier to getting merged
regardless of the technical merit. Once a patch has been developed, the rst problem
is to decide where to send it. Kernel development has a denite, if non-apparent,
hierarchy of who handles patches and how to get them submitted. As an example,
we'll take the case of 2.5.x development.
1.5 Submitting Patches
13
The rst check to make is if the patch is very small or trivial. If it is, post it
to the main kernel mailing list. If there is no bad reaction, it can be fed to what
is called the
Trivial Patch Monkey 7 .
The trivial patch monkey is exactly what it
sounds like, it takes small patches and feeds them en-masse to the correct people.
This is best suited for documentation, commentary or one-liner patches.
Patches are managed through what could be loosely called a set of rings with
Linus in the very middle having the nal say on what gets accepted into the main
tree. Linus, with rare exceptions, accepts patches only from who he refers to as his
lieutenants, a group of around 10 people who he trusts to feed him correct code.
An example lieutenant is Andrew Morton, the VM maintainer at time of writing.
Any change to the VM has to be accepted by Andrew before it will get to Linus.
These people are generally maintainers of a particular system but sometimes will
feed him patches from another subsystem if they feel it is important enough.
Each of the lieutenants are active developers on dierent subsystems. Just like
Linus, they have a small set of developers they trust to be knowledgeable about the
patch they are sending but will also pick up patches which aect their subsystem
more readily.
Depending on the subsystem, the list of people they trust will be
heavily inuenced by the list of maintainers in the
MAINTAINERS
le. The second
major area of inuence will be from the subsystem specic mailing list if there is
8
one. The VM does not have a list of maintainers but it does have a mailing list .
The maintainers and lieutenants are crucial to the acceptance of patches. Linus,
broadly speaking, does not appear to wish to be convinced with argument alone on
the merit for a signicant patch but prefers to hear it from one of his lieutenants,
which is understandable considering the volume of patches that exists.
In summary, a new patch should be emailed to the subsystem mailing list cc'd
to the main list to generate discussion. If there is no reaction, it should be sent to
the maintainer for that area of code if there is one and to the lieutenant if there is
not. Once it has been picked up by a maintainer or lieutenant, chances are it will
be merged. The important key is that patches and ideas must be released early and
often so developers have a chance to look at it while it is still manageable. There
are notable cases where massive patches merging with the main tree because there
were long periods of silence with little or no discussion. A recent example of this
is the Linux Kernel Crash Dump project which still has not been merged into the
main stream because there has not enough favorable feedback from lieutenants or
strong support from vendors.
7 http://www.kernel.org/pub/linux/kernel/people/rusty/trivial/
8 http://www.linux-mm.org/mailinglists.shtml
Chapter 2
Describing Physical Memory
Linux is available for a wide range of architectures so there needs to be an
architecture-independent way of describing memory.
This chapter describes the
structures used to keep account of memory banks, pages and the ags that aect
VM behaviour.
The rst principal concept prevalent in the VM is
(NUMA) .
Non-Uniform Memory Access
With large scale machines, memory may be arranged into banks that
incur a dierent cost to access depending on the distance from the processor. For
example, there might be a bank of memory assigned to each CPU or a bank of
memory very suitable for DMA near device cards.
Each bank is called a
struct pglist_data
node
and the concept is represented under Linux by a
even if the architecture is UMA. This struct is always refer-
pg_data_t. Every node in the system is kept on a NULL
terminated list called pgdat_list and each node is linked to the next with the eld
pg_data_t→node_next. For UMA architectures like PC desktops, only one static
pg_data_t structure called contig_page_data is used. Nodes will be discussed
enced to by it's typedef
further in Section 2.1.
Each node is divided up into a number of blocks called
zones
which represent
ranges within memory. Zones should not be confused with zone based allocators as
struct zone_struct, typedeed to
zone_t and each one is of type ZONE_DMA, ZONE_NORMAL or ZONE_HIGHMEM. Each zone
type suitable a dierent type of usage. ZONE_DMA is memory in the lower physical
memory ranges which certain ISA devices require. Memory within ZONE_NORMAL
they are unrelated. A zone is described by a
is directly mapped by the kernel into the upper region of the linear address space
which is discussed further in Section 4.1.
ZONE_HIGHMEM
is the remaining available
memory in the system and is not directly mapped by the kernel.
With the x86 the zones are:
First 16MiB of memory
ZONE_DMA
ZONE_NORMAL
ZONE_HIGHMEM
16MiB - 896MiB
896 MiB - End
It is important to note that many kernel operations can only take place using
ZONE_NORMAL so it is the most performance critical zone.
Zones are discussed further
in Section 2.2. Each physical page frame is represented by a
14
struct page and all the
2.1 Nodes
15
structs are kept in a global
of
ZONE_NORMAL
memory machines.
global
mem_map
mem_map
array which is usually stored at the beginning
or just after the area reserved for the loaded kernel image in low
struct pages
are discussed in detail in Section 2.4 and the
array is discussed in detail in Section 3.7.
The basic relationship
between all these structs is illustrated in Figure 2.1.
Figure 2.1: Relationship Between Nodes, Zones and Pages
As the amount of memory directly accessible by the kernel (ZONE_NORMAL) is
limited in size, Linux supports the concept of
further in Section 2.5.
High Memory
which is discussed
This chapter will discuss how nodes, zones and pages are
represented before introducing high memory management.
2.1 Nodes
pg_data_t which is a
typedef for a struct pglist_data. When allocating a page, Linux uses a node-local
allocation policy to allocate memory from the node closest to the running CPU. As
As we have mentioned, each node in memory is described by a
processes tend to run on the same CPU, it is likely the memory from the current
node will be used. The struct is declared as follows in
<linux/mmzone.h>:
2.1 Nodes
16
129 typedef struct pglist_data {
130
zone_t node_zones[MAX_NR_ZONES];
131
zonelist_t node_zonelists[GFP_ZONEMASK+1];
132
int nr_zones;
133
struct page *node_mem_map;
134
unsigned long *valid_addr_bitmap;
135
struct bootmem_data *bdata;
136
unsigned long node_start_paddr;
137
unsigned long node_start_mapnr;
138
unsigned long node_size;
139
int node_id;
140
struct pglist_data *node_next;
141 } pg_data_t;
We now briey describe each of these elds:
node_zones
The zones for this node,
node_zonelists
ZONE_HIGHMEM, ZONE_NORMAL, ZONE_DMA;
This is the order of zones that allocations are preferred from.
build_zonelists() in mm/page_alloc.c sets up the order when called by
free_area_init_core(). A failed allocation in ZONE_HIGHMEM may fall back
to ZONE_NORMAL or back to ZONE_DMA;
nr_zones
Number of zones in this node, between 1 and 3.
have three. A CPU bank may not have
node_mem_map
ZONE_DMA
This is the rst page of the
Not all nodes will
for example;
struct page
array representing
each physical frame in the node. It will be placed somewhere within the global
mem_map
array;
valid_addr_bitmap
A bitmap which describes holes in the memory node that
no memory exists for. In reality, this is only used by the Sparc and Sparc64
architectures and ignored by all others;
bdata
This is only of interest to the boot memory allocator discussed in Chapter 5;
node_start_paddr
The starting physical address of the node.
long does not work optimally as it breaks for ia32 with
Extension (PAE)
for example.
PAE is discussed further in Section 2.5.
more suitable solution would be to record this as a
(PFN) .
An unsigned
Physical Address
A
Page Frame Number
A PFN is simply in index within physical memory that is counted
in page-sized units. PFN for a physical address could be trivially dened as
(page_phys_addr
node_start_mapnr
This gives the page oset within the global
mem_map.
It
free_area_init_core() by calculating the number of pages
mem_map and the local mem_map for this node called lmem_map;
is calculated in
between
>> PAGE_SHIFT);
2.2 Zones
17
node_size
node_id
The total number of pages in this zone;
The
node_next
Node ID (NID)
of the node, starts at 0;
Pointer to next node in a NULL terminated list.
pgdat_list. The nodes
are placed on this list as they are initialised by the init_bootmem_core() function,
described later in Section 5.2.1. Up until late 2.4 kernels (> 2.4.18), blocks of code
All nodes in the system are maintained on a list called
that traversed the list looked something like:
pg_data_t * pgdat;
pgdat = pgdat_list;
do {
/* do something with pgdata_t */
...
} while ((pgdat = pgdat->node_next));
In more recent kernels, a macro
for_each_pgdat(), which is trivially dened as
a for loop, is provided to improve code readability.
2.2 Zones
Zones are described by a
typedef
zone_t.
struct zone_struct
and is usually referred to by it's
It keeps track of information like page usage statistics, free area
information and locks. It is declared as follows in
<linux/mmzone.h>:
37 typedef struct zone_struct {
41
spinlock_t
lock;
42
unsigned long
free_pages;
43
unsigned long
pages_min, pages_low, pages_high;
44
int
need_balance;
45
49
free_area_t
free_area[MAX_ORDER];
50
76
wait_queue_head_t * wait_table;
77
unsigned long
wait_table_size;
78
unsigned long
wait_table_shift;
79
83
struct pglist_data *zone_pgdat;
84
struct page
*zone_mem_map;
85
unsigned long
zone_start_paddr;
86
unsigned long
zone_start_mapnr;
87
91
char
*name;
92
unsigned long
size;
93 } zone_t;
2.2.1 Zone Watermarks
18
This is a brief explanation of each eld in the struct.
lock
Spinlock to protect the zone from concurrent accesses;
free_pages
Total number of free pages in the zone;
pages_min, pages_low, pages_high
These are zone watermarks which are
described in the next section;
need_balance
This ag that tells the pageout
kswapd
to balance the zone. A
zone is said to need balance when the number of available pages reaches one
of the
zone watermarks.
free_area
Watermarks is discussed in the next section;
Free area bitmaps used by the buddy allocator;
wait_table
A hash table of wait queues of processes waiting on a page to be
freed. This is of importance to
wait_on_page()
and
unlock_page().
While
processes could all wait on one queue, this would cause all waiting processes
to race for pages still locked when woken up.
A large group of processes
contending for a shared resource like this is sometimes called a thundering
herd. Wait tables are discussed further in Section 2.2.3;
wait_table_size
Number of queues in the hash table which is a power of 2;
wait_table_shift
Dened as the number of bits in a long minus the binary
logarithm of the table size above;
zone_pgdat
Points to the parent
zone_mem_map
The rst page in the global
zone_start_paddr
zone_start_mapnr
name
size
pg_data_t;
Same principle as
Same principle as
mem_map
this zone refers to;
node_start_paddr;
node_start_mapnr;
The string name of the zone, DMA, Normal or HighMem
The size of the zone in pages.
2.2.1 Zone Watermarks
When available memory in the system is low, the pageout daemon
kswapd is woken
up to start freeing pages (see Chapter 10). If the pressure is high, the process will
free up memory synchronously, sometimes referred to as the
direct-reclaim
path. The
parameters aecting pageout behaviour are similar to those by FreeBSD [McK96]
and Solaris [MM01].
Each zone has three watermarks called
pages_low, pages_min
and
pages_high
which help track how much pressure a zone is under. The relationship between them
is illustrated in Figure 2.2. The number of pages for
pages_min
is calculated in the
2.2.1 Zone Watermarks
function
19
free_area_init_core()
during memory init and is based on a ratio to
the size of the zone in pages. It is calculated initially as
ZoneSizeInPages/128.
The
lowest value it will be is 20 pages (80K on a x86) and the highest possible value is
255 pages (1MiB on a x86).
Figure 2.2: Zone Watermarks
pages_low
When
pages_low number of free pages is reached, kswapd is woken
up by the buddy allocator to start freeing pages. This is equivalent to when
lotsfree
is reached in Solaris and
the value of
pages_min
pages_min
freemin
in FreeBSD. The value is twice
by default;
pages_min is reached, the allocator will do the kswapd work
direct-reclaim path.
There is no real equivalent in Solaris but the closest is the desfree or minfree
When
in a synchronous fashion, sometimes referred to as the
which determine how often the pageout scanner is woken up;
pages_high
Once
kswapd
has been woken to start freeing pages it will not
consider the zone to be balanced when
pages_high
pages are free.
Once
2.2.2 Calculating The Size of Zones
20
the watermark has been reached,
this is called
pages_high
lotsfree
kswapd
will go back to sleep. In Solaris,
free_target.
pages_min.
and in BSD, it is called
is three times the value of
The default for
Whatever the pageout parameters are called in each operating system, the meaning is the same, it helps determine how hard the pageout daemon or processes work
to free up pages.
2.2.2 Calculating The Size of Zones
Figure 2.3: Call Graph:
setup_memory()
The PFN is an oset, counted in pages, within the physical memory map. The
rst PFN usable by the system,
page after
_end
min_low_pfn is located at the beginning of the rst
which is the end of the loaded kernel image. The value is stored as
a le scope variable in
mm/bootmem.c
for use with the boot memory allocator.
How the last page frame in the system,
max_pfn, is calculated is quite archifind_max_pfn() reads through the
tecture specic. In the x86 case, the function
whole
e820
variable in
map for the highest page frame. The value is also stored as a le scope
mm/bootmem.c.
The e820 is a table provided by the BIOS describing
what physical memory is available, reserved or non-existent.
The value of
max_low_pfn is calculated on the x86 with find_max_low_pfn()
ZONE_NORMAL. This is the physical memory directly ac-
and it marks the end of
cessible by the kernel and is related to the kernel/userspace split in the linear
address space marked by
mm/bootmem.c. Note
as the max_low_pfn.
PAGE_OFFSET.
The value, with the others, is stored in
that in low memory machines, the
max_pfn
will be the same
2.2.3 Zone Wait Queue Table
With the three variables
21
min_low_pfn, max_low_pfn and max_pfn, it is straight-
forward to calculate the start and end of high memory and place them as le scope
variables in
arch/i386/mm/init.c
as
highstart_pfn
and
highend_pfn.
The val-
ues are used later to initialise the high memory pages for the physical page allocator
as we will much later in Section 5.5.
2.2.3 Zone Wait Queue Table
When IO is being performed on a page, such are during page-in or page-out, it
is locked to prevent accessing it with inconsistent data.
Processes wishing to use
wait_on_page().
UnlockPage() and any
it have to join a wait queue before it can be accessed by calling
When the IO is completed, the page will be unlocked with
process waiting on the queue will be woken up. Each page could have a wait queue
but it would be very expensive in terms of memory to have so many separate queues
so instead, the wait queue is stored in the
zone_t.
It is possible to have just one wait queue in the zone but that would mean that all
processes waiting on any page in a zone would be woken up when one was unlocked.
thundering herd problem. Instead, a hash table of wait
zone_t→wait_table. In the event of a hash collision, processes
This would cause a serious
queues is stored in
may still be woken unnecessarily but collisions are not expected to occur frequently.
Figure 2.4: Sleeping On a Locked Page
free_area_init_core(). The size of the table
is calculated by wait_table_size() and stored in the zone_t→wait_table_size.
The table is allocated during
The maximum size it will be is 4096 wait queues. For smaller tables, the size of the
table is the minimum power of 2 required to store
number of queues,
where
PAGE_PER_WAITQUEUE
NoPages
NoPages / PAGES_PER_WAITQUEUE
is the number of pages in the zone and
is dened to be 256.
In other words, the size of the table
is calculated as the integer component of the following equation:
2.3 Zone Initialisation
22
wait_table_size = log2 (
The eld
NoPages ∗ 2
− 1)
PAGE_PER_WAITQUEUE
zone_t→wait_table_shift is calculated as the number of bits a page
address must be shifted right to return an index within the table.
page_waitqueue()
in a zone.
The function
is responsible for returning which wait queue to use for a page
It uses a simple multiplicative hashing algorithm based on the virtual
address of the
struct page
being hashed.
It works by simply multiplying the address by
GOLDEN_RATIO_PRIME and shifting
zone_t→wait_table_shift bits right to index the result within the hash
GOLDEN_RATIO_PRIME[Lev00] is the largest prime that is closest to the golden
the result
table.
ratio [Knu68]
of the largest integer that may be represented by the architecture.
2.3 Zone Initialisation
The zones are initialised after the kernel page tables have been fully setup by
paging_init().
Page table initialisation is covered in Section 3.6.
Predictably,
each architecture performs this task dierently but the objective is always the same,
to determine what parameters to send to either
free_area_init_node() for NUMA.
zones_size. The full list of parameters:
tectures or
UMA is
nid
free_area_init() for UMA archiThe only parameter required for
is the Node ID which is the logical identier of the node whose zones are being
initialised;
pgdat
be
pmap
pg_data_t that is being initialised.
contig_page_data;
is the node's
In UMA, this will simply
free_area_init_core() to point to the beginning of the
lmem_map array allocated for the node. In NUMA, this is ignored as
NUMA treats mem_map as a virtual array starting at PAGE_OFFSET. In UMA,
this pointer is the global mem_map variable which is now mem_map gets initialised
is set later by
local
in UMA.
zones_sizes
is an array containing the size of each zone in pages;
zone_start_paddr
zone_holes
is the starting physical address for the rst zone;
is an array containing the total size of memory holes in the zones;
free_area_init_core() which is responsible for lling in
relevant information and the allocation of the mem_map array
It is the core function
each
zone_t
with the
for the node.
Note that information on what pages are free for the zones is not
determined at this point.
That information is not known until the boot memory
allocator is being retired which will be discussed much later in Chapter 5.
2.3.1 Initialising mem_map
23
2.3.1 Initialising mem_map
mem_map area is created during system startup in one of two fashions. On NUMA
systems, the global mem_map is treated as a virtual array starting at PAGE_OFFSET.
free_area_init_node() is called for each active node in the system which alloThe
cates the portion of this array for the node being initialised.
On UMA systems,
free_area_init() is uses contig_page_data as the node and the global mem_map
mem_map for this node. The callgraph for both functions is shown in
as the local
Figure 2.5.
Figure 2.5: Call Graph:
The core function
free_area_init()
free_area_init_core()
allocates a local
lmem_map
for the
node being initialised. The memory for the array is allocated from the boot memory
allocator with
alloc_bootmem_node()
(see Chapter 5). With UMA architectures,
this newly allocated memory becomes the global
mem_map
but it is slightly dierent
for NUMA.
NUMA architectures allocate the memory for
ory node. The global
mem_map
lmem_map
within their own mem-
never gets explicitly allocated but instead is set to
PAGE_OFFSET where it is treated as a virtual array. The address of the local map
is stored in pg_data_t→node_mem_map which exists somewhere within the virtual
mem_map. For each zone that exists in the node, the address within the virtual
mem_map for the zone is stored in zone_t→zone_mem_map. All the rest of the code
then treats mem_map as a real array as only valid regions within it will be used by
nodes.
2.4 Pages
Every physical page frame in the system has an associated
struct page
which is
used to keep track of its status. In the 2.2 kernel [BC00], this structure resembled
2.4 Pages
24
it's equivalent in System V [GC94] but like the other UNIX variants, the structure
changed considerably. It is declared as follows in
152
153
154
155
156
158
159
161
163
164
175
176
177
179
180
<linux/mm.h>:
typedef struct page {
struct list_head list;
struct address_space *mapping;
unsigned long index;
struct page *next_hash;
atomic_t count;
unsigned long flags;
struct list_head lru;
struct page **pprev_hash;
struct buffer_head * buffers;
#if defined(CONFIG_HIGHMEM) || defined(WANT_PAGE_VIRTUAL)
void *virtual;
#endif /* CONFIG_HIGMEM || WANT_PAGE_VIRTUAL */
} mem_map_t;
Here is a brief description of each of the elds:
list
Pages may belong to many lists and this eld is used as the list head. For example, pages in a mapping will be in one of three circular linked links kept by the
address_space.
These are
clean_pages, dirty_pages
and
locked_pages.
In the slab allocator, this eld is used to store pointers to the slab and cache
the page belongs to. It is also used to link blocks of free pages together;
mapping
When les or devices are memory mapped, their inode has an associated
address_space.
This eld will point to this address space if the page belongs
to the le. If the page is anonymous and
is
index
swapper_space
mapping
is set, the
address_space
which manages the swap address space;
This eld has two uses and it depends on the state of the page what it means.
If the page is part of a le mapping, it is the oset within the le. If the page is
part of the swap cache this will be the oset within the
address_space for the
swap address space (swapper_space). Secondly, if a block of pages is being
freed for a particular process, the order (power of two number of pages being
freed) of the block being freed is stored in
__free_pages_ok();
next_hash
index.
This is set in the function
Pages that are part of a le mapping are hashed on the inode and
oset. This eld links pages together that share the same hash bucket;
count
The reference count to the page. If it drops to 0, it may be freed. Any
greater and it is in use by one or more processes or is in use by the kernel like
when waiting for IO;
2.4.1 Mapping Pages to Zones
ags
25
These are ags which describe the status of the page. All of them are de-
clared in
<linux/mm.h>
and are listed in Table 2.1. There are a number of
macros dened for testing, clearing and setting the bits which are all listed in
SetPageUptodate() which calls
arch_set_page_uptodate() if it is dened
Table 2.2. The only really interesting one is
an architecture specic function
before setting the bit;
lru
For the page replacement policy, pages that may be swapped out will exist
on either the
active_list
or the
inactive_list
declared in
page_alloc.c.
This is the list head for these LRU lists. These two lists are discussed in detail
in Chapter 10;
pprev_hash
This complement to
next_hash
so that the hash can work as a
doubly linked list;
buers
If a page has buers for a block device associated with it, this eld is used
buffer_head. An anonymous page mapped by
associated buffer_head if it is backed by a swap
to keep track of the
a process
may also have an
le. This
is necessary as the page has to be synced with backing storage in block sized
chunks dened by the underlying lesystem;
virtual
ZONE_NORMAL are directly mapped by the kernel.
ZONE_HIGHMEM, kmap() is used to map the page for the
Normally only pages from
To address pages in
kernel which is described further in Chapter 9. There are only a xed number
of pages that may be mapped. When it is mapped, this is its virtual address;
mem_map_t is a typedef for struct page so it can be easily referred to
mem_map array.
The type
within the
2.4.1 Mapping Pages to Zones
Up until as recently as kernel
with
page→zone
2.4.18,
a
struct page
consumes a lot of memory when thousands of
kernels, the
zone
stored a reference to its zone
which was later considered wasteful, as even such a small pointer
struct pages
eld has been removed and instead the top
exist. In more recent
ZONE_SHIFT
page→flags are used to determine the zone a page belongs
zone_table of zones is set up. It is declared in mm/page_alloc.c as:
x86) bits of the
First a
(8 in the
to.
33 zone_t *zone_table[MAX_NR_ZONES*MAX_NR_NODES];
34 EXPORT_SYMBOL(zone_table);
MAX_NR_ZONES is the maximum number of zones that can be in a node, i.e.
3. MAX_NR_NODES is the maximum number of nodes that may exist. The function
EXPORT_SYMBOL() makes zone_table accessible to loadable modules. This table
is treated like a multi-dimensional array. During free_area_init_core(), all the
pages in a node are initialised. First it sets the value for the table
2.5 High Memory
733
26
zone_table[nid * MAX_NR_ZONES + j] = zone;
Where nid is the node ID, j is the zone index and
each page, the function
788
set_page_zone()
zone is the zone_t struct.
For
is called as
set_page_zone(page, nid * MAX_NR_ZONES + j);
The parameter,
in the
zone_table
page,
is the page whose zone is being set. So, clearly the index
is stored in the page.
2.5 High Memory
As the addresses space usable by the kernel (ZONE_NORMAL) is limited in size, the
kernel has support for the concept of High Memory. Two thresholds of high memory
exist on 32-bit x86 systems, one at 4GiB and a second at 64GiB. The 4GiB limit is
related to the amount of memory that may be addressed by a 32-bit physical address.
To access memory between the range of 1GiB and 4GiB, the kernel temporarily maps
pages from high memory into
ZONE_NORMAL
with
kmap().
This is discussed further
in Chapter 9.
The second limit at 64GiB is related to
Physical Address Extension (PAE)
which
is an Intel invention to allow more RAM to be used with 32 bit systems. It makes 4
36
extra bits available for the addressing of memory, allowing up to 2
bytes (64GiB)
of memory to be addressed.
PAE allows a processor to address up to 64GiB in theory but, in practice, processes in Linux still cannot access that much RAM as the virtual address space is
still only 4GiB. This has led to some disappointment from users who have tried to
malloc()
all their RAM with one process.
Secondly, PAE does not allow the kernel itself to have this much RAM available.
The
struct page
used to describe each page frame still requires 44 bytes and this
uses kernel virtual address space in
ZONE_NORMAL. That means that to describe 1GiB
of memory, approximately 11MiB of kernel memory is required. Thus, with 16GiB,
176MiB of memory is consumed, putting signicant pressure on
ZONE_NORMAL. This
does not sound too bad until other structures are taken into account which use
ZONE_NORMAL. Even very small structures such as Page Table Entries (PTEs) require
about 16MiB in the worst case.
This makes 16GiB about the practical limit for
available physical memory Linux on an x86. If more memory needs to be accessed,
the advice given is simple and straightforward, buy a 64 bit machine.
2.6 What's New In 2.6
Nodes
At rst glance, there has not been many changes made to how memory is
described but the seemingly minor changes are wide reaching. The node descriptor
pg_data_t
has a few new elds which are as follows:
2.6 What's New In 2.6
node_start_pfn
27
replaces the
node_start_paddr
eld. The only dierence is
that the new eld is a PFN instead of a physical address. This was changed
as PAE architectures can address more memory than 32 bits can address so
nodes starting over 4GiB would be unreachable with the old eld;
kswapd_wait
is a new wait queue for
kswapd.
In 2.4, there was a global wait
queue for the page swapper daemon. In 2.6, there is one
node where
N
is the node identier and each
kswapdN
for each
kswapd has its own wait queue
with this eld.
The
node_size eld has been removed and replaced instead with two elds.
The
change was introduced to recognise the fact that nodes may have holes in them
where there is no physical memory backing the address.
node_present_pages
is the total number of physical pages that are present in
the node.
node_spanned_pages
is the total area that is addressed by the node, including
any holes that may exist.
Zones
Even at rst glance, zones look very dierent. They are no longer called
zone_t but instead referred to as simply struct zone.
The second major dierence
is the LRU lists. As we'll see in Chapter 10, kernel 2.4 has a global list of pages
that determine the order pages are freed or paged out. These lists are now stored
in the
struct zone.
lru_lock
The relevant elds are:
is the spinlock for the LRU lists in this zone. In 2.4, this is a global
lock called
active_list
pagemap_lru_lock;
is the active list for this zone. This list is the same as described in
Chapter 10 except it is now per-zone instead of global;
inactive_list
is the inactive list for this zone. In 2.4, it is global;
rell_counter
is the number of pages to remove from the
active_list
in one
pass. Only of interest during page replacement;
nr_active
is the number of pages on the
nr_inactive
active_list;
is the number of pages on the
all_unreclaimable
inactive_list;
is set to 1 if the pageout daemon scans through all the pages
in the zone twice and still fails to free enough pages;
pages_scanned
is the number of pages scanned since the last bulk amount of
pages has been reclaimed. In 2.6, lists of pages are freed at once rather than
freeing pages individually which is what 2.4 does;
2.6 What's New In 2.6
pressure
28
measures the scanning intensity for this zone. It is a decaying average
which aects how hard a page scanner will work to reclaim pages.
Three other elds are new but they are related to the dimensions of the zone.
They are:
zone_start_pfn
and
is the starting PFN of the zone. It replaces the
zone_start_mapnr
spanned_pages
zone_start_paddr
elds in 2.4;
is the number of pages this zone spans, including holes in mem-
ory which exist with some architectures;
present_pages
is the number of real pages that exist in the zone.
architectures, this will be the same value as
The next addition is
spanned_pages.
struct per_cpu_pageset
which is used to maintain lists
of pages for each CPU to reduce spinlock contention. The
a
NR_CPU
sized array of
struct per_cpu_pageset
upper limit of number of CPUs in the system.
For many
where
zone→pageset eld is
NR_CPU is the compiled
The per-cpu struct is discussed
further at the end of the section.
The last addition to
struct zone
is the inclusion of padding of zeros in the
struct. Development of the 2.6 VM recognised that some spinlocks are very heavily
contended and are frequently acquired. As it is known that some locks are almost
always acquired in pairs, an eort should be made to ensure they use dierent
cache lines which is a common cache programming trick [Sea00].
in the
These padding
struct zone are marked with the ZONE_PADDING() macro and are used to
zone→lock, zone→lru_lock and zone→pageset elds use dierent
ensure the
cache lines.
Pages
The rst noticeable change is that the ordering of elds has been changed
so that related items are likely to be in the same cache line. The elds are essentially
the same except for two additions. The rst is a new union used to create a PTE
chain. PTE chains are are related to page table management so will be discussed
at the end of Chapter 3.
The second addition is of
page→private
eld which
contains private information specic to the mapping. For example, the eld is used
to store a pointer to a
buffer_head
if the page is a buer page. This means that
page→buffers eld has also been removed. The last important change is that
page→virtual is no longer necessary for high memory support and will only exist
the
if the architecture specically requests it. How high memory pages are supported is
discussed further in Chapter 9.
Per-CPU Page Lists
In 2.4, only one subsystem actively tries to maintain per-
cpu lists for any object and that is the Slab Allocator, discussed in Chapter 8. In
2.6, the concept is much more wide-spread and there is a formalised concept of hot
and cold pages.
2.6 What's New In 2.6
The
struct per_cpu_pageset,
29
declared in
<linux/mmzone.h>
per_cpu_pages.
eld which is an array with two elements of type
has one one
The zeroth
element of this array is for hot pages and the rst element is for cold pages where
hot and cold determines how active the page is currently in the cache. When it
is known for a fact that the pages are not to be referenced soon, such as with IO
readahead, they will be allocated as cold pages.
The
struct per_cpu_pages maintains a count of the number of pages currently
in the list, a high and low watermark which determine when the set should be
relled or pages freed in bulk, a variable which determines how many pages should
be allocated in one block and nally, the actual list head of pages.
To build upon the per-cpu page lists, there is also a per-cpu page accounting
struct page_state that holds a number of accounting varipgalloc eld which tracks the number of pages allocated to this
mechanism. There is a
ables such as the
pswpin which tracks the number of swap readins. The struct is heavily
<linux/page-flags.h>. A single function mod_page_state() is
provided for updating elds in the page_state for the running CPU and three
helper macros are provided called inc_page_state(), dec_page_state() and
sub_page_state().
CPU and
commented in
2.6 What's New In 2.6
Bit name
PG_active
30
Description
active_list
This bit is set if a page is on the
and cleared when it is removed.
LRU
It marks a page as
being hot
PG_arch_1
Quoting directly from the code:
PG_arch_1 is an archi-
tecture specic page state bit. The generic code guarantees that this bit is cleared for a page when it rst is
entered into the page cache. This allows an architecture to defer the ushing of the D-Cache (See Section
3.9) until the page is mapped by a process
PG_checked
PG_dirty
Only used by the Ext2 lesystem
This indicates if a page needs to be ushed to disk.
When a page is written to that is backed by disk, it is
not ushed immediately, this bit is needed to ensure a
dirty page is not freed before it is written out
PG_error
PG_fs_1
If an error occurs during disk I/O, this bit is set
Bit reserved for a lesystem to use for it's own purposes. Currently, only NFS uses it to indicate if a page
is in sync with the remote server or not
PG_highmem
Pages in high memory cannot be mapped permanently
by the kernel.
Pages that are in high memory are
agged with this bit during
PG_launder
mem_init()
This bit is important only to the page replacement
policy.
When the VM wants to swap out a page, it
will set this bit and call the
writepage()
function.
When scanning, if it encounters a page with this bit
and
PG_locked
PG_locked set, it will wait for the I/O to complete
This bit is set when the page must be locked in memory for disk I/O. When I/O starts, this bit is set and
released when it completes
PG_lru
PG_referenced
If
a
page
is
on
inactive_list,
either
the
active_list
or
the
this bit will be set
If a page is mapped and it is referenced through the
mapping, index hash table, this bit is set.
It is used
during page replacement for moving the page around
the LRU lists
PG_reserved
This is set for pages that can never be swapped out.
It is set by the boot memory allocator (See Chapter 5)
for pages allocated during system startup. Later it is
used to ag empty pages or ones that do not even exist
PG_slab
PG_skip
This will ag a page as being used by the slab allocator
Used by some architectures to skip over parts of the
address space with no backing physical memory
PG_unused
PG_uptodate
This bit is literally unused
When a page is read from disk without error, this bit
will be set.
Table 2.1: Flags Describing Page Status
2.6 What's New In 2.6
Bit name
PG_active
PG_arch_1
PG_checked
PG_dirty
PG_error
PG_highmem
PG_launder
PG_locked
PG_lru
PG_referenced
PG_reserved
PG_skip
PG_slab
PG_unused
PG_uptodate
31
Set
Test
Clear
n/a
n/a
n/a
SetPageChecked()
SetPageDirty()
SetPageError()
SetPageLaunder()
LockPage()
TestSetPageLRU()
SetPageReferenced()
SetPageReserved()
PageChecked()
PageDirty()
PageError()
PageHighMem()
PageLaunder()
PageLocked()
PageLRU()
PageReferenced()
PageReserved()
n/a
n/a
n/a
PageSetSlab()
PageSlab()
PageClearSlab()
n/a
n/a
n/a
SetPageUptodate()
PageUptodate()
ClearPageUptodate()
SetPageActive()
n/a
PageActive()
Table 2.2: Macros For Testing, Setting and Clearing
ClearPageActive()
n/a
ClearPageDirty()
ClearPageError()
n/a
ClearPageLaunder()
UnlockPage()
TestClearPageLRU()
ClearPageReferenced()
ClearPageReserved()
page→flags
Status Bits
Chapter 3
Page Table Management
Linux layers the machine independent/dependent layer in an unusual manner in
comparison to other operating systems [CP99]. Other operating systems have objects which manage the underlying physical pages such as the
pmap
object in BSD.
Linux instead maintains the concept of a three-level page table in the architecture
independent code even if the underlying architecture does not support it.
While
this is conceptually easy to understand, it also means that the distinction between
dierent types of pages is very blurry and page types are identied by their ags or
what lists they exist on rather than the objects they belong to.
Architectures that manage their
Memory Management Unit (MMU)
dierently
are expected to emulate the three-level page tables. For example, on the x86 without
Page Middle Directory
(PMD) is dened to be of size 1 and folds back directly onto the Page Global
Directory (PGD) which is optimised out at compile time. Unfortunately, for architectures that do not manage their cache or Translation Lookaside Buer (TLB)
PAE enabled, only two page table levels are available. The
automatically, hooks for machine dependent have to be explicitly left in the code for
when the TLB and CPU caches need to be altered and ushed even if they are null
operations on some architectures like the x86. These hooks are discussed further in
Section 3.8.
This chapter will begin by describing how the page table is arranged and what
types are used to describe the three separate levels of the page table followed by
how a virtual address is broken up into its component parts for navigating the table.
Once covered, it will be discussed how the lowest level entry, the
(PTE)
and what bits are used by the hardware.
Page Table Entry
After that, the macros used for
navigating a page table, setting and checking attributes will be discussed before
talking about how the page table is populated and how pages are allocated and
freed for the use with page tables. The initialisation stage is then discussed which
shows how the page tables are initialised during boot strapping.
cover how the TLB and CPU caches are utilised.
32
Finally, we will
3.1 Describing the Page Directory
33
3.1 Describing the Page Directory
Each process a pointer (mm_struct→pgd) to its own
Page Global Directory (PGD)
which is a physical page frame. This frame contains an array of type pgd_t which is
an architecture specic type dened in <asm/page.h>. The page tables are loaded
dierently depending on the architecture.
loaded by copying
mm_struct→pgd
On the x86, the process page table is
into the
cr3
register which has the side eect
of ushing the TLB. In fact this is how the function
__flush_tlb() is implemented
in the architecture dependent code.
Each active entry in the PGD table points to a page frame containing an array
Page Middle Directory (PMD) entries of type pmd_t which in turn points to page
frames containing Page Table Entries (PTE) of type pte_t, which nally points
of
to page frames containing the actual user data.
In the event the page has been
swapped out to backing storage, the swap entry is stored in the PTE and used by
do_swap_page() during page fault to nd the swap entry containing the page data.
The page table layout is illustrated in Figure 3.1.
Figure 3.1: Page Table Layout
Any given linear address may be broken up into parts to yield osets within
these three page table levels and an oset within the actual page. To help break
up the linear address into its component parts, a number of macros are provided in
triplets for each page table level, namely a
SHIFT
SHIFT,
a
SIZE
and a
MASK
macro. The
macros species the length in bits that are mapped by each level of the page
3.1 Describing the Page Directory
34
tables as illustrated in Figure 3.2.
Figure 3.2: Linear Address Bit Size Macros
The
MASK
values can be ANDd with a linear address to mask out all the upper
bits and is frequently used to determine if a linear address is aligned to a given level
within the page table. The
each entry at each level.
SIZE
macros reveal how many bytes are addressed by
The relationship between the
SIZE
and
MASK
macros is
illustrated in Figure 3.3.
Figure 3.3: Linear Address Size and Mask Macros
For the calculation of each of the triplets, only
SHIFT
is important as the other
two are calculated based on it. For example, the three macros for page level on the
x86 are:
5 #define PAGE_SHIFT
6 #define PAGE_SIZE
7 #define PAGE_MASK
12
(1UL << PAGE_SHIFT)
(~(PAGE_SIZE-1))
PAGE_SHIFT
is the length in bits of the oset part of the linear address space
PAGE_SHIFT
which is 12 bits on the x86. The size of a page is easily calculated as 2
which is the equivalent of the code above.
negation of the bits which make up the
on a page boundary,
PAGE_ALIGN()
Finally the mask is calculated as the
PAGE_SIZE - 1.
is used.
If a page needs to be aligned
This macro adds
PAGE_SIZE - 1
to
3.2 Describing a Page Table Entry
35
the address before simply ANDing it with the
PAGE_MASK to zero out the page oset
bits.
PMD_SHIFT
is the number of bits in the linear address which are mapped by the
second level part of the table.
The
PMD_SIZE
and
PMD_MASK
are calculated in a
similar way to the page level macros.
PGDIR_SHIFT
is the number of bits which are mapped by the top, or rst level,
of the page table.
The
PGDIR_SIZE
and
PGDIR_MASK
are calculated in the same
manner as above.
PTRS_PER_x which determine the
PTRS_PER_PGD is the number of
pointers in the PGD, 1024 on an x86 without PAE. PTRS_PER_PMD is for the PMD,
1 on the x86 without PAE and PTRS_PER_PTE is for the lowest level, 1024 on the
The last three macros of importance are the
number of entries in each level of the page table.
x86.
3.2 Describing a Page Table Entry
As mentioned, each entry is described by the structs
PTEs, PMDs and PGDs respectively.
pte_t, pmd_t
and
pgd_t
for
Even though these are often just unsigned
integers, they are dened as structs for two reasons. The rst is for type protection
so that they will not be used inappropriately. The second is for features like PAE
on the x86 where an additional 4 bits is used for addressing more than 4GiB of
memory. To store the protection bits,
pgprot_t
is dened which holds the relevant
ags and is usually stored in the lower bits of a page table entry.
For type casting, 4 macros are provided in
asm/page.h, which takes the above
pte_val(), pmd_val(),
types and returns the relevant part of the structs. They are
pgd_val() and pgprot_val(). To reverse the type casting,
provided __pte(), __pmd(), __pgd() and __pgprot().
4 more macros are
Where exactly the protection bits are stored is architecture dependent.
For
illustration purposes, we will examine the case of an x86 architecture without PAE
enabled but the same principles apply across architectures. On an x86 with no PAE,
the
pte_t is simply a 32 bit integer within a struct.
Each
pte_t points to an address
of a page frame and all the addresses pointed to are guaranteed to be page aligned.
Therefore, there are
PAGE_SHIFT (12) bits in that 32 bit value that are free for status
bits of the page table entry. A number of the protection and status bits are listed
in Table 3.1 but what bits exist and what they mean varies between architectures.
These bits are self-explanatory except for the
_PAGE_PROTNONE
which we will
discuss further. On the x86 with Pentium III and higher, this bit is called the
Attribute Table (PAT)
Page
while earlier architectures such as the Pentium II had this bit
reserved. The PAT bit is used to indicate the size of the page the PTE is referencing.
In a PGD entry, this same bit is instead called the
Page Size Exception (PSE)
bit
so obviously these bits are meant to be used in conjunction.
As Linux does not use the PSE bit for user pages, the PAT bit is free in the
PTE for other purposes.
There is a requirement for having a page resident in
memory but inaccessible to the userspace process such as when a region is protected
3.3 Using Page Table Entries
Bit
36
Function
_PAGE_PRESENT
_PAGE_PROTNONE
_PAGE_RW
_PAGE_USER
_PAGE_DIRTY
_PAGE_ACCESSED
Page is resident in memory and not swapped out
Page is resident but not accessable
Set if the page may be written to
Set if the page is accessible from user space
Set if the page is written to
Set if the page is accessed
Table 3.1: Page Table Entry Protection and Status Bits
mprotect() with the PROT_NONE ag. When the region is to be protected,
_PAGE_PRESENT bit is cleared and the _PAGE_PROTNONE bit is set. The macro
pte_present() checks if either of these bits are set and so the kernel itself knows
the PTE is present, just inaccessible to userspace which is a subtle, but important
point. As the hardware bit _PAGE_PRESENT is clear, a page fault will occur if the
with
the
page is accessed so Linux can enforce the protection while still knowing the page is
resident if it needs to swap it out or the process exits.
3.3 Using Page Table Entries
<asm/pgtable.h>
Macros are dened in
and examination of page table entries.
which are important for the navigation
To navigate the page directories, three
macros are provided which break up a linear address space into its component parts.
pgd_offset()
mm_struct for the process and returns the
pmd_offset() takes a PGD entry
relevant PMD. pte_offset() takes a PMD and
takes an address and the
PGD entry that covers the requested address.
and an address and returns the
returns the relevant PTE. The remainder of the linear address provided is the oset
within the page. The relationship between these elds is illustrated in Figure 3.1.
The second round of macros determine if the page table entries are present or
may be used.
• pte_none(), pmd_none() and pgd_none() return 1 if the corresponding entry
does not exist;
• pte_present(), pmd_present()
sponding page table entries have the
• pte_clear(), pmd_clear()
and
pgd_present()
PRESENT bit set;
and
pgd_clear()
return 1 if the corre-
will clear the corresponding
page table entry;
• pmd_bad()
and
pgd_bad()
are used to check entries when passed as input
parameters to functions that may change the value of the entries. Whether it
returns 1 varies between the few architectures that dene these macros but for
those that actually dene it, making sure the page entry is marked as present
and accessed are the two most important checks.
3.3 Using Page Table Entries
37
There are many parts of the VM which are littered with page table walk code
and it is important to recognise it. A very simple example of a page table walk is
the function
follow_page() in mm/memory.c.
The following is an excerpt from that
function, the parts unrelated to the page table walk are omitted:
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
pgd_t *pgd;
pmd_t *pmd;
pte_t *ptep, pte;
pgd = pgd_offset(mm, address);
if (pgd_none(*pgd) || pgd_bad(*pgd))
goto out;
pmd = pmd_offset(pgd, address);
if (pmd_none(*pmd) || pmd_bad(*pmd))
goto out;
ptep = pte_offset(pmd, address);
if (!ptep)
goto out;
pte = *ptep;
It simply uses the three oset macros to navigate the page tables and the
and
_bad()
_none()
macros to make sure it is looking at a valid page table.
The third set of macros examine and set the permissions of an entry.
The
permissions determine what a userspace process can and cannot do with a particular
page. For example, the kernel page table entries are never readable by a userspace
process.
•
pte_mkread()
•
and cleared with
pte_rdprotect();
The write permissions are tested with
and cleared with
•
pte_read(),
The read permissions for an entry are tested with
pte_wrprotect();
pte_write(),
The execute permissions are tested with
and cleared with
pte_exprotect().
set with
pte_exec(),
set with
pte_mkwrite()
set with
pte_mkexec()
It is worth nothing that with the x86
architecture, there is no means of setting execute permissions on pages so
these three macros act the same way as the read macros;
•
The permissions can be modied to a new value with
pte_modify() but its use
change_pte_range()
is almost non-existent. It is only used in the function
in
mm/mprotect.c.
3.4 Translating and Setting Page Table Entries
38
The fourth set of macros examine and set the state of an entry. There are only
two bits that are important in Linux, the dirty bit and the accessed bit. To check
pte_dirty() and pte_young() macros are used. To set
bits, the macros pte_mkdirty() and pte_mkyoung() are used. To clear them,
macros pte_mkclean() and pte_old() are available.
these bits, the macros
the
the
3.4 Translating and Setting Page Table Entries
This set of functions and macros deal with the mapping of addresses and pages to
PTEs and the setting of the individual entries.
The macro
mk_pte()
struct page and protection bits and combines
pte_t that needs to be inserted into the page table.
mk_pte_phys() exists which takes a physical page address as a
takes a
them together to form the
A similar macro
parameter.
pte_page() returns the struct page which corresponds to the PTE
entry. pmd_page() returns the struct page containing the set of PTEs.
The macro set_pte() takes a pte_t such as that returned by mk_pte() and
places it within the processes page tables. pte_clear() is the reverse operation. An
additional function is provided called ptep_get_and_clear() which clears an entry
from the process page table and returns the pte_t. This is important when some
modication needs to be made to either the PTE protection or the struct page
The macro
itself.
3.5 Allocating and Freeing Page Tables
The last set of functions deal with the allocation and freeing of page tables. Page
tables, as stated, are physical pages containing an array of entries and the allocation
and freeing of physical pages is a relatively expensive operation, both in terms of
time and the fact that interrupts are disabled during page allocation. The allocation
and deletion of page tables, at any of the three levels, is a very frequent operation
so it is important the operation is as quick as possible.
Hence the pages used for the page tables are cached in a number of dierent
lists called
quicklists .
Each architecture implements these caches dierently but the
principles used are the same. For example, not all architectures cache PGDs because
the allocation and freeing of them only happens during process creation and exit.
As both of these are very expensive operations, the allocation of another page is
negligible.
PGDs, PMDs and PTEs have two sets of functions each for the allocation and
The allocation functions are pgd_alloc(), pmd_alloc()
pte_alloc() respectively and the free functions are, predictably enough, called
pgd_free(), pmd_free() and pte_free().
freeing of page tables.
and
Broadly speaking, the three implement caching with the use of three caches
called
pgd_quicklist, pmd_quicklist
and
pte_quicklist.
Architectures imple-
ment these three lists in dierent ways but one method is through the use of a
3.6 Kernel Page Tables
39
LIFO type structure. Ordinarily, a page table entry contains points to other pages
containing page tables or data. While cached, the rst element of the list is used to
point to the next free page table. During allocation, one page is popped o the list
and during free, one is placed as the new head of the list. A count is kept of how
many pages are used in the cache.
pgd_quicklist is not externally dened
get_pgd_fast() is a common choice for the
The quick allocation function from the
outside of the architecture although
function name.
dened as
The cached allocation function for PMDs and PTEs are publicly
pmd_alloc_one_fast()
and
pte_alloc_one_fast().
If a page is not available from the cache, a page will be allocated using the
physical page allocator (see Chapter 6). The functions for the three levels of page
tables are
get_pgd_slow(), pmd_alloc_one()
and
pte_alloc_one().
Obviously a large number of pages may exist on these caches and so there is
a mechanism in place for pruning them.
Each time the caches grow or shrink,
a counter is incremented or decremented and it has a high and low watermark.
check_pgt_cache()
is called in two places to check these watermarks. When the
high watermark is reached, entries from the cache will be freed until the cache size
returns to the low watermark. The function is called after
clear_page_tables()
when a large number of page tables are potentially reached and is also called by the
system idle task.
3.6 Kernel Page Tables
When the system rst starts, paging is not enabled as page tables do not magically
initialise themselves. Each architecture implements this dierently so only the x86
case will be discussed. The page table initialisation is divided into two phases. The
bootstrap phase sets up page tables for just 8MiB so the paging unit can be enabled.
The second phase initialises the rest of the page tables. We discuss both of these
phases below.
3.6.1 Bootstrapping
startup_32() is responsible for enabling the paging unit in
arch/i386/kernel/head.S. While all normal kernel code in vmlinuz is compiled
with the base address at PAGE_OFFSET + 1MiB, the kernel is actually loaded beginThe assembler function
ning at the rst megabyte (0x00100000) of memory. The rst megabyte is used by
some devices for communication with the BIOS and is skipped. The bootstrap code
in this le treats 1MiB as its base address by subtracting
__PAGE_OFFSET
from any
address until the paging unit is enabled so before the paging unit is enabled, a page
table mapping has to be established which translates the 8MiB of physical memory
to the virtual address
PAGE_OFFSET.
Initialisation begins with statically dening at compile time an array called
swapper_pg_dir
which is placed using linker directives at 0x00101000.
establishes page table entries for 2 pages,
pg0
and
pg1.
It then
If the processor supports
3.6.2 Finalising
the
40
Page Size Extension (PSE)
bit, it will be set so that pages will be translated are
pg0 and pg1 are
pg1 are placed at
4MiB pages, not 4KiB as is the normal case. The rst pointers to
placed to cover the region
PAGE_OFFSET+1MiB.
1-9MiB
the second pointers to
pg0
and
This means that when paging is enabled, they will map to the
correct pages using either physical or virtual addressing for just the kernel image.
The rest of the kernel page tables will be initialised by
paging_init().
Once this mapping has been established, the paging unit is turned on by setting a
cr0 register and a jump takes places immediately to ensure the Instruction
Pointer (EIP register) is correct.
bit in the
3.6.2 Finalising
The function responsible for nalising the page tables is called
paging_init().
The
call graph for this function on the x86 can be seen on Figure 3.4.
Figure 3.4: Call Graph:
paging_init()
pagetable_init() to initialise the page tables necessary
to reference all physical memory in ZONE_DMA and ZONE_NORMAL. Remember that
high memory in ZONE_HIGHMEM cannot be directly referenced and mappings are set
up for it temporarily. For each pgd_t used by the kernel, the boot memory allocator
The function rst calls
(see Chapter 5) is called to allocate a page for the PMDs and the PSE bit will
be set if available to use 4MiB TLB entries instead of 4KiB. If the PSE bit is not
supported, a page for PTEs will be allocated for each
pmd_t.
If the CPU supports
the PGE ag, it also will be set so that the page table entry will be global and
visible to all processes.
Next,
pagetable_init()
calls
fixrange_init()
to setup the xed address
space mappings at the end of the virtual address space starting at
FIXADDR_START.
These mappings are used for purposes such as the local APIC and the atomic kmap-
FIX_KMAP_BEGIN and FIX_KMAP_END required by kmap_atomic(). Finally, the function calls fixrange_init() to initialise the page table entries required
for normal high memory mappings with kmap().
pings between
3.7 Mapping addresses to a struct page
Once
pagetable_init()
41
returns, the page tables for kernel space are now full
initialised so the static PGD (swapper_pg_dir) is loaded into the CR3 register so
that the static table is now being used by the paging unit.
paging_init() is responsible for calling kmap_init() to
PAGE_KERNEL protection ags. The nal task is
zone_sizes_init() which initialises all the zone structures used.
The next task of the
initialise each of the PTEs with the
to call
3.7 Mapping addresses to a struct
page
There is a requirement for Linux to have a fast method of mapping virtual addresses
to physical addresses and for mapping
struct pages to their physical address.
Linux
achieves this by knowing where, in both virtual and physical memory, the global
mem_map
array is as the global array has pointers to all
physical memory in the system.
struct pages
representing
All architectures achieve this with very similar
mechanisms but for illustration purposes, we will only examine the x86 carefully.
This section will rst discuss how physical addresses are mapped to kernel virtual
addresses and then what this means to the
mem_map
array.
3.7.1 Mapping Physical to Virtual Kernel Addresses
As we saw in Section 3.6, Linux sets up a direct mapping from the physical address 0
to the virtual address
PAGE_OFFSET at 3GiB on the x86.
This means that any virtual
PAGE_OFFSET
which is essentially what the function virt_to_phys() with the macro __pa() does:
address can be translated to the physical address by simply subtracting
/* from <asm-i386/page.h> */
132 #define __pa(x)
((unsigned long)(x)-PAGE_OFFSET)
/* from <asm-i386/io.h> */
76 static inline unsigned long virt_to_phys(volatile void * address)
77 {
78
return __pa(address);
79 }
PAGE_OFFSET which is
phys_to_virt() with the macro __va(). Next we see
mapping of struct pages to physical addresses.
Obviously the reverse operation involves simply adding
carried out by the function
how this helps the
3.7.2 Mapping struct
pages
to Physical Addresses
As we saw in Section 3.6.1, the kernel image is located at the physical address 1MiB,
which of course translates to the virtual address
PAGE_OFFSET + 0x00100000 and a
virtual region totaling about 8MiB is reserved for the image which is the region that
can be addressed by two PGDs. This would imply that the rst available memory to
use is located at
0xC0800000 but that is not the case.
Linux tries to reserve the rst
3.8 Translation Lookaside Buer (TLB)
ZONE_DMA
16MiB of memory for
actually
0xC1000000.
42
so rst virtual area used for kernel allocations is
This is where the global
mem_map is usually located. ZONE_DMA
will be still get used, but only when absolutely necessary.
struct pages by treating them as an index
PAGE_SHIFT bits to the right will
treat it as a PFN from physical address 0 which is also an index within the mem_map
array. This is exactly what the macro virt_to_page() does which is declared as
follows in <asm-i386/page.h>:
Physical addresses are translated to
into the
mem_map array.
Shifting a physical address
#define virt_to_page(kaddr) (mem_map + (__pa(kaddr) >> PAGE_SHIFT))
virt_to_page() takes the virtual address kaddr, converts it to the
__pa(), converts it into an array index by bit shifting it right
PAGE_SHIFT bits and indexing into the mem_map by simply adding them together.
No macro is available for converting struct pages to physical addresses but at this
The macro
physical address with
stage, it should be obvious to see how it could be calculated.
3.8 Translation Lookaside Buer (TLB)
Initially, when the processor needs to map a virtual address to a physical address, it
must traverse the full page directory searching for the PTE of interest. This would
normally imply that each assembly instruction that references memory actually requires several separate memory references for the page table traversal [Tan01]. To
avoid this considerable overhead, architectures take advantage of the fact that most
processes exhibit a locality of reference or, in other words, large numbers of memory
references tend to be for a small number of pages. They take advantage of this reference locality by providing a
Translation Lookaside Buer (TLB)
which is a small
associative memory that caches virtual to physical page table resolutions.
Linux assumes that the most architectures support some type of TLB although
the architecture independent code does not cares how it works. Instead, architecture
dependant hooks are dispersed throughout the VM code at points where it is known
that some hardware with a TLB would need to perform a TLB related operation.
For example, when the page tables have been updated, such as after a page fault has
completed, the processor may need to be update the TLB for that virtual address
mapping.
Not all architectures require these type of operations but because some do, the
hooks have to exist. If the architecture does not require the operation to be performed, the function for that TLB operation will a null operation that is optimised
out at compile time.
A quite large list of TLB API hooks, most of which are declared in
<asm/pgtable.h>,
are listed in Tables 3.2 and 3.3 and the APIs are quite well documented in the kernel source by
Documentation/cachetlb.txt [Mil00].
It is possible to have just one
TLB ush function but as both TLB ushes and TLB rells are
very
expensive op-
erations, unnecessary TLB ushes should be avoided if at all possible. For example,
3.9 Level 1 CPU Cache Management
43
when context switching, Linux will avoid loading new page tables using
Flushing,
Lazy TLB
discussed further in Section 4.3.
void flush_tlb_all(void)
This ushes the entire TLB on all processors running in the system making
it the most expensive TLB ush operation. After it completes, all modications
to the page tables will be visible globally. This is required after the kernel page
tables, which are global in nature, have been modied such as after
vfree() (See
Chapter 7) completes or after the PKMap is ushed (See Chapter 9).
void flush_tlb_mm(struct mm_struct *mm)
This ushes all TLB entries related to the userspace portion (i.e.
PAGE_OFFSET)
for the requested mm context.
below
In some architectures, such as
MIPS, this will need to be performed for all processors but usually it is conned
to the local processor. This is only called when an operation has been performed
that aects the entire address space, such as after all the address mapping have
dup_mmap()
exit_mmap().
been duplicated with
been deleted with
for fork or after all memory mappings have
void flush_tlb_range(struct mm_struct *mm, unsigned long start,
unsigned long end)
As the name indicates, this ushes all entries within the requested userspace
range for the mm context.
or changeh as during
mremap()
changes the permissions.
which moves regions or
mprotect()
which
The function is also indirectly used during unmap-
munmap() which
flush_tlb_range() intelligently.
ping a region with
use
This is used after a new region has been moved
calls
tlb_finish_mmu()
which tries to
This API is provided for architectures
that can remove ranges of TLB entries quickly rather than iterating with
flush_tlb_page().
Table 3.2: Translation Lookaside Buer Flush API
3.9 Level 1 CPU Cache Management
As Linux manages the CPU Cache in a very similar fashion to the TLB, this section covers how Linux utilises and manages the CPU cache. CPU caches, like TLB
caches, take advantage of the fact that programs tend to exhibit a locality of reference [Sea00] [CS98]. To avoid having to fetch data from main memory for each
reference, the CPU will instead cache very small amounts of data in the CPU cache.
Frequently, there is two levels called the Level 1 and Level 2 CPU caches. The Level
2 CPU caches are larger but slower than the L1 cache but Linux only concerns itself
with the Level 1 or L1 cache.
3.9 Level 1 CPU Cache Management
44
void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)
Predictably, this API is responsible for ushing a single page from the TLB.
The two most common usage of it is for ushing the TLB after a page has been
faulted in or has been paged out.
void flush_tlb_pgtables(struct mm_struct *mm, unsigned long start,
unsigned long end)
This API is called with the page tables are being torn down and freed. Some
platforms cache the lowest level of the page table, i.e.
the actual page frame
storing entries, which needs to be ushed when the pages are being deleted. This
is called when a region is being unmapped and the page directory entries are
being reclaimed.
void update_mmu_cache(struct vm_area_struct *vma, unsigned long
addr, pte_t pte)
This API is only called after a page fault completes. It tells the architecture
dependant code that a new translation now exists at
addr.
pte
for the virtual address
It is up to each architecture how this information should be used.
For
example, Sparc64 uses the information to decide if the local CPU needs to ush
it's data cache or does it need to send an IPI to a remote processor.
Table 3.3: Translation Lookaside Buer Flush API (cont)
CPU caches are organised into
lines .
Each line is typically quite small, usually
32 bytes and each line is aligned to it's boundary size. In other words, a cache line
of 32 bytes will be aligned on a 32 byte address. With Linux, the size of the line is
L1_CACHE_BYTES
which is dened by each architecture.
How addresses are mapped to cache lines vary between architectures but the
mappings come under three headings,
associative mapping .
direct mapping , associative mapping
and
set
Direct mapping is the simpliest approach where each block
of memory maps to only one possible cache line.
With associative mapping, any
block of memory can map to any cache line. Set associative mapping is a hybrid
approach where any block of memory can may to any line but only within a subset
of the available lines. Regardless of the mapping scheme, they each have one thing
in common, addresses that are close together and aligned to the cache size are likely
to use dierent lines. Hence Linux employs simple tricks to try and maximise cache
usage
•
Frequently accessed structure elds are at the start of the structure to increase
the chance that only one line is needed to address the common elds;
•
Unrelated items in a structure should try to be at least cache size bytes apart
to avoid false sharing between CPUs;
•
Objects in the general caches, such as the
mm_struct cache, are aligned to the
3.10 What's New In 2.6
45
L1 CPU cache to avoid false sharing.
If the CPU references an address that is not in the cache, a
cache miss occurs
and the data is fetched from main memory. The cost of cache misses is quite high as
a reference to cache can typically be performed in less than 10ns where a reference
to main memory typically will cost between 100ns and 200ns. The basic objective
is then to have as many cache hits and as few cache misses as possible.
Just as some architectures do not automatically manage their TLBs, some do not
automatically manage their CPU caches. The hooks are placed in locations where
the virtual to physical mapping changes, such as during a page table update. The
CPU cache ushes should always take place rst as some CPUs require a virtual to
physical mapping to exist when the virtual address is being ushed from the cache.
The three operations that require proper ordering are important is listed in Table
3.4.
Flushing Full MM
flush_cache_mm()
Flushing Range
flush_cache_range()
Flushing Page
Change all page tables
Change page table range
Change single PTE
flush_tlb_mm()
flush_tlb_range()
flush_tlb_page()
flush_cache_page()
Table 3.4: Cache and TLB Flush Ordering
The API used for ushing the caches are declared in
<asm/pgtable.h> and are
listed in Tables 3.5. In many respects, it is very similar to the TLB ushing API.
It does not end there though.
A second set of interfaces is required to avoid
virtual aliasing problems. The problem is that some CPUs select lines based on the
virtual address meaning that one physical address can exist on multiple lines leading
to cache coherency problems. Architectures with this problem may try and ensure
that shared mappings will only use addresses as a stop-gap measure.
However, a
proper API to address is problem is also supplied which is listed in Table 3.6.
3.10 What's New In 2.6
Most of the mechanics for page table management are essentially the same for 2.6
but the changes that have been introduced are quite wide reaching and the implementations in-depth.
MMU-less Architecture Support
mm/nommu.c.
A new le has been introduced called
This source le contains replacement code for functions that assume
the existence of a MMU like
mmap()
for example. This is to support architectures,
usually microcontrollers, that have no MMU. Much of the work in this area was
developed by the uCLinux Project (http://www.uclinux.org ).
3.10 What's New In 2.6
46
void flush_cache_all(void)
This ushes the entire CPU cache system making it the most severe ush
operation to use. It is used when changes to the kernel page tables, which are
global in nature, are to be performed.
void flush_cache_mm(struct mm_struct mm)
This ushes all entires related to the address space. On completion, no cache
lines will be associated with
mm.
void flush_cache_range(struct mm_struct *mm, unsigned long start,
unsigned long end)
This ushes lines related to a range of addresses in the address space.
Like
it's TLB equivilant, it is provided in case the architecture has an ecent way of
ushing ranges instead of ushing each individual page.
void flush_cache_page(struct vm_area_struct *vma, unsigned long
vmaddr)
This is for ushing a single page sized region.
mm_struct is easily accessible via vma→vm_mm.
VM_EXEC ag, the architecture will know if the
The VMA is supplied as the
Additionally, by testing for the
region is executable for caches
that separate the instructions and data caches. VMAs are described further in
Chapter 4.
Table 3.5: CPU Cache Flush API
Reverse Mapping
The most signicant and important change to page table man-
agement is the introduction of
Reverse Mapping (rmap) .
Referring to it as rmap
is deliberate as it is the common usage of the acronym and should not be confused
with the -rmap tree developed by Rik van Riel which has many more alterations to
the stock VM than just the reverse mapping.
In a single sentence, rmap grants the ability to locate all PTEs which map a
particular page given just the
struct page.
In 2.4, the only way to nd all PTEs
which map a shared page, such as a memory mapped shared library, is to linearaly
search all page tables belonging to all processes. This is far too expensive and Linux
tries to avoid the problem by using the swap cache (see Section 11.4). This means
that with many shared pages, Linux may have to swap out entire processes regardless
of the page age and usage patterns. 2.6 instead has a
every
struct page
PTE chain
associated with
which may be traversed to remove a page from all page tables
that reference it. This way, pages in the LRU can be swapped out in an intelligent
manner without resorting to swapping entire processes.
As might be imagined by the reader, the implementation of this simple concept is a little involved. The rst step in understanding the implementation is the
union pte that is a eld in struct page. This has union has two elds, a pointer
to a struct pte_chain called chain and a pte_addr_t called direct. The union
3.10 What's New In 2.6
47
void flush_page_to_ram(unsigned long address)
This is a deprecated API which should no longer be used and in fact will be
removed totally for 2.6. It is covered here for completeness and because it is still
used. The function is called when a new physical page is about to be placed in
the address space of a process. It is required to avoid writes from kernel space
being invisible to userspace after the mapping occurs.
void flush_dcache_page(struct page *page)
This function is called when the kernel writes to or copies from a page cache
page as these are likely to be mapped by multiple processes.
void flush_icache_range(unsigned long address, unsigned long
endaddr)
This is called when the kernel stores information in addresses that is likely to
be executed, such as when a kermel module has been loaded.
void flush_icache_user_range(struct vm_area_struct *vma, struct
page *page, unsigned long addr, int len)
This is similar to flush_icache_range() except it is called when a userspace
range is aected. Currently, this is only used for ptrace() (used when debugging)
when the address space is being accessed by access_process_vm().
void flush_icache_page(struct vm_area_struct *vma, struct page
*page)
This is called when a page-cache page is about to be mapped. It is up to the
architecture to use the VMA ags to determine whether the I-Cache or D-Cache
should be ushed.
Table 3.6: CPU D-Cache and I-Cache Flush API
is an optisation whereby
direct
is used to save memory if there is only one PTE
mapping the entry, otherwise a chain is used. The type
pte_addr_t
varies between
architectures but whatever its type, it can be used to locate a PTE, so we will treat
pte_t for simplicity.
The struct pte_chain is a little more complex. The struct itself is very simple
but it is compact with overloaded elds and a lot of development eort has been spent
it as a
on making it small and ecient. Fortunately, this does not make it indecipherable.
First, it is the responsibility of the slab allocator to allocate and manage
struct pte_chains as it is this type of task the slab allocator is best at. Each
struct pte_chain can hold up to NRPTE pointers to PTE structures. Once that
many PTEs have been lled, a struct pte_chain is allocated and added to the
chain.
struct pte_chain has two elds. The rst is unsigned long next_and_idx
which has two purposes. When next_and_idx is ANDed with NRPTE, it returns the
The
3.10 What's New In 2.6
48
struct pte_chain indicating where the next free
next_and_idx is ANDed with the negation of NRPTE (i.e. ∼NRPTE), a
1
the next struct pte_chain in the chain is returned . This is basically
number of PTEs currently in this
slot is. When
pointer to
how a PTE chain is implemented.
To give a taste of the rmap intricacies, we'll give an example of what happens
when a new PTE needs to map a page.
The basic process is to have the caller
pte_chain with pte_chain_alloc(). This allocated chain is passed
with the struct page and the PTE to page_add_rmap(). If the existing PTE
chain associated with the page has slots available, it will be used and the pte_chain
allocated by the caller returned. If no slots were available, the allocated pte_chain
allocate a new
will be added to the chain and NULL returned.
There is a quite substantial API associated with rmap, for tasks such as creating chains and adding and removing PTEs to a chain, but a full listing is beyond
the scope of this section. Fortunately, the API is conned to
mm/rmap.c
and the
functions are heavily commented so their purpose is clear.
There are two main benets, both related to pageout, with the introduction
of reverse mapping.
The rst is with the setup and tear-down of pagetables.
As
will be seen in Section 11.4, pages being paged out are placed in a swap cache and
information is written into the PTE necessary to nd the page again. This can lead
to multiple minor faults as pages are put into the swap cache and then faulted again
by a process. With rmap, the setup and removal of PTEs is atomic. The second
major benet is when pages need to paged out, nding all PTEs referencing the
pages is a simple operation but impractical with 2.4, hence the swap cache.
Reverse mapping is not without its cost though. The rst, and obvious one, is the
additional space requirements for the PTE chains. Arguably, the second is a CPU
cost associated with reverse mapping but it has not been proved to be signicant.
What is important to note though is that reverse mapping is only a benet when
pageouts are frequent. If the machines workload does not result in much pageout
or memory is ample, reverse mapping is all cost with little or no benet.
At the
time of writing, the merits and downsides to rmap is still the subject of a number
of discussions.
Object-Based Reverse Mapping
The reverse mapping required for each page
can have very expensive space requirements.
To compound the problem, many
of the reverse mapped pages in a VMA will be essentially identical.
One way of
addressing this is to reverse map based on the VMAs rather than individual pages.
That is, instead of having a reverse mapping for each page, all the VMAs which
map a particular page would be traversed and unmap the page from each.
Note
that objects in this case refers to the VMAs, not an object in the object-orientated
2
sense of the word . At the time of writing, this feature has not been merged yet
and was last seen in kernel 2.5.68-mm1 but there is a strong incentive to have it
1 Told you it was compact
2 Don't blame me, I didn't name it. In fact the original patch for this feature came with the
comment From Dave. Crappy name
3.10 What's New In 2.6
49
available if the problems with it can be resolved. For the very curious, the patch
for just le/device backed objrmap at this release is available
3
but it is only for the
very very curious reader.
There are two tasks that require all PTEs that map a page to be traversed. The
rst task is
page_referenced()
which checks all PTEs that map a page to see if
the page has been referenced recently. The second task is when a page needs to be
unmapped from all processes with
try_to_unmap().
To complicate matters further,
there are two types of mappings that must be reverse mapped, those that are backed
by a le or device and those that are anonymous. In both cases, the basic objective
is to traverse all VMAs which map a particular page and then walk the page table
for that VMA to get the PTE. The only dierence is how it is implemented. The
case where it is backed by some sort of le is the easiest case and was implemented
rst so we'll deal with it rst. For the purposes of illustrating the implementation,
page_referenced() is implemented.
page_referenced() calls page_referenced_obj()
we'll discuss how
which is the top level func-
tion for nding all PTEs within VMAs that map the page. As the page is mapped for
a le or device,
page→mapping
contains a pointer to a valid
address_space.
The
address_space has two linked lists which contain all VMAs which use the mapping
with the address_space→i_mmap and address_space→i_mmap_shared elds. For
every VMA that is on these linked lists, page_referenced_obj_one() is called with
the VMA and the page as parameters. The function page_referenced_obj_one()
rst checks if the page is in an address managed by this VMA and if so, traverses
mm_struct using the
page for that mm_struct.
the page tables of the
PTE mapping the
VMA (vma→vm_mm) until it nds the
Anonymous page tracking is a lot trickier and was implented in a number of
stages. It only made a very brief appearance and was removed again in 2.5.65-mm4
as it conicted with a number of other changes. The rst stage in the implementation
page→mapping and page→index elds to track mm_struct and address
pairs. These elds previously had been used to store a pointer to swapper_space
and a pointer to the swp_entry_t (See Chapter 11). Exactly how it is addressed is
beyond the scope of this section but the summary is that swp_entry_t is stored in
page→private
try_to_unmap_obj() works in a similar fashion but obviously, all the PTEs that
was to use
reference a page with this method can do so without needing to reverse map the
individual pages. There is a serious search complexity problem that is preventing it
being merged. The scenario that describes the problem is as follows;
Take a case where 100 processes have 100 VMAs mapping a single le.
unmap a
single
10,000 VMAs to be searched, most of which are totally unnecessary.
based reverse mapping, only 100
process.
To
page in this case with object-based reverse mapping would require
With page
pte_chain slots need to be examined, one for each
address_space
An optimisation was introduced to order VMAs in the
by virtual address but the search for a single page is still far too expensive for
3 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.5/2.5.68/2.5.68mm2/experimental
3.10 What's New In 2.6
50
object-based reverse mapping to be merged.
PTEs in High Memory
In 2.4, page table entries exist in
ZONE_NORMAL
as the
kernel needs to be able to address them directly during a page table walk. This was
acceptable until it was found that, with high memory machines,
ZONE_NORMAL
was
being consumed by the third level page table PTEs. The obvious answer is to move
PTEs to high memory which is exactly what 2.6 does.
As we will see in Chapter 9, addressing information in high memory is far from
free, so moving PTEs to high memory is a compile time conguration option. In
short, the problem is that the kernel must map pages from high memory into the
lower address space before it can be used but there is a very limited number of slots
available for these mappings introducing a troublesome bottleneck.
However, for
applications with a large number of PTEs, there is little other option. At time of
writing, a proposal has been made for having a User Kernel Virtual Area (UKVA)
which would be a region in kernel space private to each process but it is unclear if
it will be merged for 2.6 or not.
To take the possibility of high memory mapping into account, the macro
pte_offset()
from 2.4 has been replaced with
pte_offset_map() in 2.6. If PTEs
pte_offset() and return the ad-
are in low memory, this will behave the same as
dress of the PTE. If the PTE is in high memory, it will rst be mapped into low
memory with
kmap_atomic()
so it can be used by the kernel. This PTE must be
unmapped as quickly as possible with
pte_unmap().
In programming terms, this means that page table walk code looks slightly different. In particular, to nd the PTE for a given address, the code now reads as
(taken from
640
641
642
643
644
645
mm/memory.c);
ptep = pte_offset_map(pmd, address);
if (!ptep)
goto out;
pte = *ptep;
pte_unmap(ptep);
Instead of pte_alloc(),
pte_alloc_kernel() for use with kernel PTE mappings and
pte_alloc_map() for userspace mapping. The principal dierence between them
is that pte_alloc_kernel() will never use high memory for the PTE.
Additionally, the PTE allocation API has changed.
there is now a
In memory management terms, the overhead of having to map the PTE from
high memory should not be ignored.
Only one PTE may be mapped per CPU
at a time, although a second may be mapped with
pte_offset_map_nested().
This introduces a penalty when all PTEs need to be examined, such as during
zap_page_range()
when all PTEs in a given range need to be unmapped.
At time of writing, a patch has been submitted which places PMDs in high
memory using essentially the same mechanism and API changes. It is likely that it
will be merged.
3.10 What's New In 2.6
51
Huge TLB Filesystem
Most modern architectures support more than one page
size. For example, on many x86 architectures, there is an option to use 4KiB pages
or 4MiB pages. Traditionally, Linux only used large pages for mapping the actual
kernel image and no where else. As TLB slots are a scarce resource, it is desirable
to be able to take advantages of the large pages especially on machines with large
amounts of physical memory.
In 2.6, Linux allows processes to use huge pages, the size of which is determined
by
HPAGE_SIZE.
The number of available huge pages is determined by the system
/proc/sys/vm/nr_hugepages proc interface which ultiset_hugetlb_mem_size(). As the success of the allocation
administrator by using the
matly uses the function
depends on the availability of physically contiguous memory, the allocation should
be made during system startup.
Huge TLB Filesystem (hugetlbfs) which is
a pseudo-lesystem implemented in fs/hugetlbfs/inode.c. Basically, each le in
this lesystem is backed by a huge page. During initialisation, init_hugetlbfs_fs()
registers the le system and mounts it as an internal lesystem with kern_mount().
The root of the implementation is a
There are two ways that huge pages may be accessed by a process. The rst is
shmget() to setup a shared region backed by huge pages
call mmap() on a le opened in the huge page lesystem.
by using
is the
and the second
When a shared memory region should be backed by huge pages, the process
shmget() and pass SHM_HUGETLB as one of the ags.
hugetlb_zero_setup() being called which creates a new le in
should call
This results in
the root of the
internal hugetlb lesystem. A le is created in the root of the internal lesystem.
The name of the le is determined by an atomic counter called
hugetlbfs_counter
which is incremented every time a shared region is setup.
To create a le backed by huge pages, a lesystem of type hugetlbfs must rst be
mounted by the system administrator. Instructions on how to perform this task are
detailed in
Documentation/vm/hugetlbpage.txt.
Once the lesystem is mounted,
open(). When mmap() is called
file_operations struct hugetlbfs_file_operations ensures
hugetlbfs_file_mmap() is called to setup the region properly.
les can be created as normal with the system call
on the open le, the
that
Huge TLB pages have their own function for the management of page tables,
address space operations and lesystem operations. The names of the functions for
page table management can all be seen in
<linux/hugetlb.h> and they are named
very similar to their normal page equivalents. The implementation of the hugetlb
functions are located near their normal page equivalents so are easy to nd.
Cache Flush Management
The changes here are minimal. The API function
flush_page_to_ram() has being totally removed and a new API flush_dcache_range()
has been introduced.
Chapter 4
Process Address Space
One of the principal advantages of virtual memory is that each process has its own
virtual address space, which is mapped to physical memory by the operating system.
In this chapter we will discuss the process address space and how Linux manages it.
Zero pageThe kernel treats the userspace portion of the address space very differently to the kernel portion.
For example, allocations for the kernel are satis-
ed immediately and are visible globally no matter what process is on the CPU.
vmalloc()
is partially an exception as a minor page fault will occur to sync the
process page tables with the reference page tables, but the page will still be allocated immediately upon request.
With a process, space is simply reserved in the
linear address space by pointing a page table entry to a read-only globally visible
page lled with zeros. On writing, a page fault is triggered which results in a new
page being allocated, lled with zeros, placed in the page table entry and marked
writable. It is lled with zeros so that the new page will appear exactly the same
as the global zero-lled page.
The userspace portion is not trusted or presumed to be constant.
After each
context switch, the userspace portion of the linear address space can potentially
change except when a
Lazy TLB
switch is used as discussed later in Section 4.3. As
a result of this, the kernel must be prepared to catch all exception and addressing
errors raised from userspace. This is discussed in Section 4.5.
This chapter begins with how the linear address space is broken up and what
the purpose of each section is. We then cover the structures maintained to describe
each process, how they are allocated, initialised and then destroyed. Next, we will
cover how individual regions within the process space are created and all the various
functions associated with them. That will bring us to exception handling related to
the process address space, page faulting and the various cases that occur to satisfy
a page fault. Finally, we will cover how the kernel safely copies information to and
from userspace.
52
4.1 Linear Address Space
53
4.1 Linear Address Space
From a user perspective, the address space is a at linear address space but predictably, the kernel's perspective is very dierent. The address space is split into
two parts, the userspace part which potentially changes with each full context switch
and the kernel address space which remains constant. The location of the split is
determined by the value of
PAGE_OFFSET
which is at
0xC0000000
on the x86. This
means that 3GiB is available for the process to use while the remaining 1GiB is
always mapped by the kernel. The linear virtual address space as the kernel sees it
is illustrated in Figure 4.1.
Figure 4.1: Kernel Address Space
8MiB (the amount of memory addressed by two PGDs) is reserved at
PAGE_OFFSET
for loading the kernel image to run. 8MiB is simply a reasonable amount of space
to reserve for the purposes of loading the kernel image. The kernel image is placed
in this reserved space during kernel page tables initialisation as discussed in Section
3.6.1. Somewhere shortly after the image, the
mem_map
for UMA architectures, as
discussed in Chapter 2, is stored. The location of the array is usually at the 16MiB
ZONE_DMA but not always. With NUMA architectures, portions
mem_map will be scattered throughout this region and where they are
mark to avoid using
of the virtual
actually located is architecture dependent.
The region between
PAGE_OFFSET and VMALLOC_START - VMALLOC_OFFSET is the
physical memory map and the size of the region depends on the amount of available
RAM. As we saw in Section 3.6, page table entries exist to map physical memory
to the virtual address range beginning at
PAGE_OFFSET. Between the physical memVMALLOC_OFFSET
ory map and the vmalloc address space, there is a gap of space
in size, which on the x86 is 8MiB, to guard against out of bounds errors.
illustration, on a x86 with 32MiB of RAM,
PAGE_OFFSET + 0x02000000 + 0x00800000.
VMALLOC_START
For
will be located at
In low memory systems, the remaining amount of the virtual address space,
minus a 2 page gap, is used by
vmalloc()
for representing non-contiguous mem-
ory allocations in a contiguous virtual address space.
the vmalloc area extends as far as
extra regions are introduced.
PKMAP_BASE
In high-memory systems,
minus the two page gap and two
The rst, which begins at
PKMAP_BASE,
is an area
4.2 Managing the Address Space
54
reserved for the mapping of high memory pages into low memory with
discussed in Chapter 9.
extends from
kmap()
as
The second is for xed virtual address mappings which
FIXADDR_START
to
FIXADDR_TOP.
Fixed virtual addresses are needed
for subsystems that need to know the virtual address at compile time such as the
Advanced Programmable Interrupt Controller (APIC) . FIXADDR_TOP is statically dened to be 0xFFFFE000 on the x86 which is one page before the end of the virtual
address space. The size of the xed mapping region is calculated at compile time in
__FIXADDR_SIZE and used to index back from FIXADDR_TOP to give the start of the
region FIXADDR_START
The region required for vmalloc(), kmap() and the xed virtual address mapping
is what limits the size of ZONE_NORMAL. As the running kernel needs these functions,
a region of at least VMALLOC_RESERVE will be reserved at the top of the address space.
VMALLOC_RESERVE is architecture specic but on the x86, it is dened as 128MiB.
This is why ZONE_NORMAL is generally referred to being only 896MiB in size; it is the
1GiB of the upper potion of the linear address space minus the minimum 128MiB
that is reserved for the vmalloc region.
4.2 Managing the Address Space
The address space usable by the process is managed by a high level
is roughly analogous to the
vmspace
mm_struct which
struct in BSD [McK96].
Each address space consists of a number of page-aligned regions of memory that
are in use. They never overlap and represent a set of addresses which contain pages
that are related to each other in terms of protection and purpose.
are represented by a
struct vm_area_struct
These regions
and are roughly analogous to the
vm_map_entry struct in BSD. For clarity, a region may represent the process heap
for use with malloc(), a memory mapped le such as a shared library or a block
of anonymous memory allocated with mmap(). The pages for this region may still
have to be allocated, be active and resident or have been paged out.
If a region is backed by a le, its vm_file eld will be set. By traversing
vm_file→f_dentry→d_inode→i_mapping, the associated address_space for the
region may be obtained. The address_space has all the lesystem specic information required to perform page-based operations on disk.
The relationship between the dierent address space related structures is illustraed in 4.2. A number of system calls are provided which aect the address space
and regions. These are listed in Table 4.1.
4.3 Process Address Space Descriptor
The process address space is described by the
mm_struct
struct meaning that only
one exists for each process and is shared between userspace threads. In fact, threads
are identied in the task list by nding all
same
mm_struct.
task_structs which have pointers to the
4.3 Process Address Space Descriptor
55
Figure 4.2: Data Structures related to the Address Space
A unique
mm_struct
is not needed for kernel threads as they will never page
fault or access the userspace portion.
The only exception is page faulting within
the vmalloc space. The page fault handling code treats this as a special case and
updates the current page table with information in the the master page table. As
a
mm_struct
is not needed for kernel threads, the
task_struct→mm
eld for kernel
mm_struct
exit_mm() to
threads is always NULL. For some tasks such as the boot idle task, the
is never setup but for kernel threads, a call to
daemonize()
will call
decrement the usage counter.
As TLB ushes are extremely expensive, especially with architectures such as the
PPC, a technique called
lazy TLB
is employed which avoids unnecessary TLB ushes
by processes which do not access the userspace page tables as the kernel portion of
the address space is always visible. The call to
ush, is avoided by borrowing the
it in
task_struct→active_mm.
switch_mm(), which results in a TLB
mm_struct used by the previous task and placing
This technique has made large improvements to
context switches times.
When entering lazy TLB, the function
that a
mm_struct
enter_lazy_tlb()
is called to ensure
is not shared between processors in SMP machines, making it
a NULL operation on UP machines.
The second time use of lazy TLB is during
4.3 Process Address Space Descriptor
56
System Call Description
fork()
Creates a new process with a new address space. All the
pages are marked COW and are shared between the two
processes until a page fault occurs to make private copies
clone()
clone()
allows a new process to be created that shares
parts of its context with its parent and is how threading
is implemented in Linux.
clone() without the CLONE_VM
set will create a new address space which is essentially
the same as
fork()
mmap()
mmap()
mremap()
Remaps or resizes a region of memory.
creates a new region within the process linear
address space
If the virtual
address space is not available for the mapping, the region
may be moved unless the move is forbidden by the caller.
munmap()
This destroys part or all of a region. If the region been
unmapped is in the middle of an existing region, the
existing region is split into two separate regions
shmat()
This attaches a shared memory segment to a process address space
shmdt()
Removes a shared memory segment from an address
execve()
This loads a new executable le replacing the current
exit()
Destroys an address space and all regions
space
address space
Table 4.1: System Calls Related to Memory Regions
process exit when
start_lazy_tlb()
is used briey while the process is waiting to
be reaped by the parent.
The struct has two reference counts called
mm_users and mm_count for two types
mm_users is a reference count of processes accessing the userspace portion
of for this mm_struct, such as the page tables and le mappings. Threads and the
swap_out() code for instance will increment this count making sure a mm_struct is
not destroyed early. When it drops to 0, exit_mmap() will delete all mappings and
tear down the page tables before decrementing the mm_count.
mm_count is a reference count of the anonymous users for the mm_struct iniof users.
tialised at 1 for the real user. An anonymous user is one that does not necessarily
care about the userspace portion and is just borrowing the
users are kernel threads which use lazy TLB switching.
to 0, the
mm_struct
mm_struct.
When this count drops
can be safely destroyed. Both reference counts exist because
anonymous users need the
mm_struct
to exist even if the userspace mappings get
destroyed and there is no point delaying the teardown of the page tables.
The
mm_struct
Example
is dened in
<linux/sched.h>
as follows:
4.3 Process Address Space Descriptor
57
206 struct mm_struct {
207
struct vm_area_struct * mmap;
208
rb_root_t mm_rb;
209
struct vm_area_struct * mmap_cache;
210
pgd_t * pgd;
211
atomic_t mm_users;
212
atomic_t mm_count;
213
int map_count;
214
struct rw_semaphore mmap_sem;
215
spinlock_t page_table_lock;
216
217
struct list_head mmlist;
221
222
unsigned long start_code, end_code, start_data, end_data;
223
unsigned long start_brk, brk, start_stack;
224
unsigned long arg_start, arg_end, env_start, env_end;
225
unsigned long rss, total_vm, locked_vm;
226
unsigned long def_flags;
227
unsigned long cpu_vm_mask;
228
unsigned long swap_address;
229
230
unsigned dumpable:1;
231
232
/* Architecture-specific MM context */
233
mm_context_t context;
234 };
The meaning of each of the eld in this sizeable struct is as follows:
mmap The head of a linked list of all VMA regions in the address space;
mm_rb
The VMAs are arranged in a linked list and in a red-black tree for fast
lookups. This is the root of the tree;
mmap_cache
The VMA found during the last call to
find_vma()
is stored in
this eld on the assumption that the area will be used again soon;
pgd The Page Global Directory for this process;
mm_users
A reference count of users accessing the userspace portion of the ad-
dress space as explained at the beginning of the section;
mm_count A reference count of the anonymous users for the mm_struct starting
at 1 for the real user as explained at the beginning of this section;
map_count Number of VMAs in use;
4.3 Process Address Space Descriptor
mmap_sem
58
This is a long lived lock which protects the VMA list for readers
and writers. As users of this lock require it for a long time and may need to
sleep, a spinlock is inappropriate. A reader of the list takes this semaphore
down_read(). If they need to write, it is taken with down_write() and
page_table_lock spinlock is later acquired while the VMA linked lists
with
the
are being updated;
page_table_lock This protects most elds on the mm_struct.
As well as the page
tables, it protects the RSS (see below) count and the VMA from modication;
mmlist All mm_structs are linked together via this eld;
start_code, end_code The start and end address of the code section;
start_data, end_data The start and end address of the data section;
start_brk, brk The start and end address of the heap;
start_stack Predictably enough, the start of the stack region;
arg_start, arg_end The start and end address of command line arguments;
env_start, env_end The start and end address of environment variables;
rss Resident Set Size (RSS)
is the number of resident pages for this process. It
should be noted that the global zero page is not accounted for by RSS;
total_vm The total memory space occupied by all VMA regions in the process;
locked_vm The number of resident pages locked in memory;
def_ags Only one possible value, VM_LOCKED. It is used to determine if all future
mappings are locked by default or not;
cpu_vm_mask
A bitmask representing all possible CPUs in an SMP system.
The mask is used by an
InterProcessor Interrupt (IPI)
to determine if a pro-
cessor should execute a particular function or not. This is important during
TLB ush for each CPU;
swap_address Used by the pageout daemon to record the last address that was
swapped from when swapping out entire processes;
dumpable Set by prctl(), this ag is important only when tracing a process;
context Architecture specic MMU context.
There are a small number of functions for dealing with
described in Table 4.2.
mm_structs.
They are
4.3.1 Allocating a Descriptor
Function
59
Description
mm_init()
Initialises a
mm_struct
by setting starting values for
each eld, allocating a PGD, initialising spinlocks etc.
allocate_mm()
mm_alloc()
exit_mmap()
mm_struct() from the slab allocator
Allocates a mm_struct using allocate_mm() and calls
mm_init() to initialise it
Walks through a mm_struct and unmaps all VMAs as-
copy_mm()
Makes an exact copy of the current tasks
Allocates a
sociated with it
mm_struct
for a new task. This is only used during fork
free_mm()
Returns the
mm_struct
to the slab allocator
Table 4.2: Functions related to memory region descriptors
4.3.1 Allocating a Descriptor
Two functions are provided to allocate a
mm_struct.
To be slightly confusing, they
allocate_mm() is
slab allocator (see
mm_init() to initialise
are essentially the same but with small important dierences.
just a preprocessor macro which allocates a
Chapter 8).
mm_alloc()
mm_struct
from the
allocates from slab and then calls
it.
4.3.2 Initialising a Descriptor
The initial
mm_struct in the system is called init_mm() and is statically initialised
INIT_MM().
at compile time using the macro
238 #define INIT_MM(name) \
239 {
240
mm_rb:
RB_ROOT,
241
pgd:
swapper_pg_dir,
242
mm_users:
ATOMIC_INIT(2),
243
mm_count:
ATOMIC_INIT(1),
244
mmap_sem:
__RWSEM_INITIALIZER(name.mmap_sem),
245
page_table_lock: SPIN_LOCK_UNLOCKED,
246
mmlist:
LIST_HEAD_INIT(name.mmlist),
247 }
Once it is established, new
\
\
\
\
\
\
\
\
mm_structs are created using their parent mm_struct
copy_mm() and it
as a template. The function responsible for the copy operation is
uses
init_mm()
to initialise process specic elds.
4.3.3 Destroying a Descriptor
atomic_inc(&mm->mm_users),
mm_users count reaches zero, all
While a new user increments the usage count with
it is decremented with a call to
mmput().
If the
4.4 Memory Regions
60
the mapped regions are destroyed with
exit_mmap()
and the page tables destroyed
mm_count
count
The full address space of a process is rarely used, only sparse regions are.
Each
as there is no longer any users of the userspace portions.
The
mmdrop() as all the users of the page tables and VMAs are
mm_struct user. When mm_count reaches zero, the mm_struct will
is decremented with
counted as one
be destroyed.
4.4 Memory Regions
region is represented by a
vm_area_struct
which never overlap and represent a set
of addresses with the same protection and purpose. Examples of a region include
a read-only shared library loaded into the address space or the process heap.
A
full list of mapped regions a process has may be viewed via the proc interface at
/proc/PID/maps where PID is the process ID of the process that is to be examined.
The region may have a number of dierent structures associated with it as illustrated in Figure 4.2. At the top, there is the
vm_area_struct
which on its own is
enough to represent anonymous memory.
If the region is backed by a le, the
vm_file eld which has a pointer to the
get the struct address_space which has
struct file is
struct inode.
available through the
The inode is used to
all the private information about the
le including a set of pointers to lesystem functions which perform the lesystem
specic operations such as reading and writing pages to disk.
The struct
vm_area_struct
is declared as follows in
<linux/mm.h>:
4.4 Memory Regions
61
44 struct vm_area_struct {
45
struct mm_struct * vm_mm;
46
unsigned long vm_start;
47
unsigned long vm_end;
49
50
/* linked list of VM areas per task, sorted by address */
51
struct vm_area_struct *vm_next;
52
53
pgprot_t vm_page_prot;
54
unsigned long vm_flags;
55
56
rb_node_t vm_rb;
57
63
struct vm_area_struct *vm_next_share;
64
struct vm_area_struct **vm_pprev_share;
65
66
/* Function pointers to deal with this struct. */
67
struct vm_operations_struct * vm_ops;
68
69
/* Information about our backing store: */
70
unsigned long vm_pgoff;
72
struct file * vm_file;
73
unsigned long vm_raend;
74
void * vm_private_data;
75 };
vm_mm The mm_struct this VMA belongs to;
vm_start The starting address of the region;
vm_end The end address of the region;
vm_next
All the VMAs in an address space are linked together in an address-
ordered singly linked list via this eld It is interesting to note that the VMA
list is one of the very rare cases where a singly linked list is used in the kernel;
vm_page_prot The protection ags that are set for each PTE in this VMA. The
dierent bits are described in Table 3.1;
vm_ags
A set of ags describing the protections and properties of the VMA.
They are all dened in
<linux/mm.h>
and are described in Table 4.3
vm_rb As well as being in a linked list, all the VMAs are stored on a red-black tree
for fast lookups. This is important for page fault handling when nding the
correct region quickly is important, especially for a large number of mapped
regions;
4.4.1 Memory Region Operations
vm_next_share
62
Shared VMA regions based on le mappings (such as shared
libraries) linked together with this eld;
vm_pprev_share The complement of vm_next_share;
vm_ops
The
nopage().
vm_pgo
vm_ops
eld contains functions pointers for
open(), close()
and
These are needed for syncing with information from the disk;
This is the page aligned oset within a le that is memory mapped;
vm_le The struct file pointer to the le being mapped;
vm_raend This is the end address of a read-ahead window.
When a fault occurs,
a number of additional pages after the desired page will be paged in.
This
eld determines how many additional pages are faulted in;
vm_private_data Used by some device drivers to store private information.
Not
of concern to the memory manager.
All the regions are linked together on a linked list ordered by address via the
vm_next eld.
When searching for a free area, it is a simple matter of traversing the
list but a frequent operation is to search for the VMA for a particular address such
as during page faulting for example. In this case, the red-black tree is traversed as
it has
O( log N)
search time on average. The tree is ordered so that lower addresses
than the current node are on the left leaf and higher addresses are on the right.
4.4.1 Memory Region Operations
open(), close() and
vm_operations_struct in the VMA called
There are three operations which a VMA may support called
nopage(). It supports these with a
vma→vm_ops. The struct contains three function pointers and is declared as follows
in <linux/mm.h>:
133 struct vm_operations_struct {
134
void (*open)(struct vm_area_struct * area);
135
void (*close)(struct vm_area_struct * area);
136
struct page * (*nopage)(struct vm_area_struct * area,
unsigned long address,
int unused);
137 };
The
open()
and
close()
functions are will be called every time a region is
created or deleted. These functions are only used by a small number of devices, one
lesystem and System V shared regions which need to perform additional operations
when regions are opened or closed. For example, the System V
open() callback will
increment the number of VMAs using a shared segment (shp→shm_nattch).
The main operation of interest is the
during a page-fault by
do_no_page().
nopage()
callback. This callback is used
The callback is responsible for locating the
4.4.1 Memory Region Operations
63
Protection Flags
Flags
Description
VM_READ
VM_WRITE
VM_EXEC
VM_SHARED
VM_DONTCOPY
VM_DONTEXPAND
Pages may be read
Pages may be written
Pages may be executed
Pages may be shared
VMA will not be copied on fork
Prevents a region being resized. Flag is unused
mmap Related Flags
VM_MAYREAD
VM_MAYWRITE
VM_MAYEXEC
VM_MAYSHARE
VM_GROWSDOWN
VM_GROWSUP
VM_SHM
VM_DENYWRITE
Allow the
Allow the
Allow the
Allow the
VM_READ ag to be set
VM_WRITE ag to be set
VM_EXEC ag to be set
VM_SHARE ag to be set
Shared segment (probably stack) may grow down
Shared segment (probably heap) may grow up
Pages are used by shared SHM memory segment
What
MAP_DENYWRITE
for
mmap()
translates to. Now
unused
VM_EXECUTABLE
What
MAP_EXECUTABLE for mmap() translates to.
Now
unused
VM_STACK_FLAGS
Locking Flags
VM_LOCKED
VM_IO
Flags used by
setup_arg_flags() to setup the stack
If set, the pages will not be swapped out. Set by mlock()
Signals that the area is a mmaped region for IO to a
device. It will also prevent the region being core dumped
VM_RESERVED
Do not swap out this region, used by device drivers
madvise() Flags
VM_SEQ_READ
VM_RAND_READ
A hint that pages will be accessed sequentially
A hint stating that readahead in the region is useless
Figure 4.3: Memory Region Flags
4.4.2 File/Device backed memory regions
64
page in the page cache or allocating a page and populating it with the required data
before returning it.
vm_operations_struct() called
generic_file_vm_ops. It registers only a nopage() function called filemap_nopage().
This nopage() function will either locating the page in the page cache or read the
information from disk. The struct is declared as follows in mm/filemap.c:
Most les that are mapped will use a generic
2243 static struct vm_operations_struct generic_file_vm_ops = {
2244
nopage:
filemap_nopage,
2245 };
4.4.2 File/Device backed memory regions
In the event the region is backed by a le, the
address_space as shown in Figure 4.2.
vm_file
leads to an associated
The struct contains information of relevance
to the lesystem such as the number of dirty pages which must be ushed to disk.
It is declared as follows in
<linux/fs.h>:
406 struct address_space {
407
struct list_head
clean_pages;
408
struct list_head
dirty_pages;
409
struct list_head
locked_pages;
410
unsigned long
nrpages;
411
struct address_space_operations *a_ops;
412
struct inode
*host;
413
struct vm_area_struct
*i_mmap;
414
struct vm_area_struct
*i_mmap_shared;
415
spinlock_t
i_shared_lock;
416
int
gfp_mask;
417 };
A brief description of each eld is as follows:
clean_pages
List of clean pages that need no synchronisation with backing
stoarge;
dirty_pages List of dirty pages that need synchronisation with backing storage;
locked_pages List of pages that are locked in memory;
nrpages Number of resident pages in use by the address space;
a_ops
A struct of function for manipulating the lesystem.
provides it's own
address_space_operations
generic functions;
host The host inode the le belongs to;
Each lesystem
although they sometimes use
4.4.2 File/Device backed memory regions
65
i_mmap A list of private mappings using this address_space;
i_mmap_shared A list of VMAs which share mappings in this address_space;
i_shared_lock A spinlock to protect this structure;
gfp_mask The mask to use when calling __alloc_pages() for new pages.
Periodically the memory manager will need to ush information to disk.
The
memory manager does not know and does not care how information is written to
a_ops struct is
<linux/fs.h>:
disk, so the
follows in
used to call the relevant functions.
It is declared as
385 struct address_space_operations {
386
int (*writepage)(struct page *);
387
int (*readpage)(struct file *, struct page *);
388
int (*sync_page)(struct page *);
389
/*
390
* ext3 requires that a successful prepare_write() call be
391
* followed by a commit_write() call - they must be balanced
392
*/
393
int (*prepare_write)(struct file *, struct page *,
unsigned, unsigned);
394
int (*commit_write)(struct file *, struct page *,
unsigned, unsigned);
395
/* Unfortunately this kludge is needed for FIBMAP.
* Don't use it */
396
int (*bmap)(struct address_space *, long);
397
int (*flushpage) (struct page *, unsigned long);
398
int (*releasepage) (struct page *, int);
399 #define KERNEL_HAS_O_DIRECT
400
int (*direct_IO)(int, struct inode *, struct kiobuf *,
unsigned long, int);
401 #define KERNEL_HAS_DIRECT_FILEIO
402
int (*direct_fileIO)(int, struct file *, struct kiobuf *,
unsigned long, int);
403
void (*removepage)(struct page *);
404 };
These elds are all function pointers which are described as follows;
writepage
Write a page to disk. The oset within the le to write to is stored
within the page struct. It is up to the lesystem specic code to nd the block.
See
buffer.c:block_write_full_page();
readpage Read a page from disk.
See
buffer.c:block_read_full_page();
4.4.3 Creating A Memory Region
sync_page Sync a dirty page with disk.
66
See
buffer.c:block_sync_page();
prepare_write This is called before data is copied from userspace into a page that
will be written to disk. With a journaled lesystem, this ensures the lesystem
log is up to date. With normal lesystems, it makes sure the needed buer
pages are allocated. See
commit_write
buffer.c:block_prepare_write();
After the data has been copied from userspace, this function is
called to commit the information to disk. See
bmap
buffer.c:block_commit_write();
Maps a block so that raw IO can be performed.
Mainly of concern to
lesystem specic code although it is also when swapping out pages that are
backed by a swap le instead of a swap partition;
ushpage
See
This makes sure there is no IO pending on a page before releasing it.
buffer.c:discard_bh_page();
releasepage This tries to ush all the buers associated with a page before freeing
the page itself. See
direct_IO
try_to_free_buffers().
This function is used when performing direct IO to an inode.
#define
The
exists so that external modules can determine at compile-time if the
function is available as it was only introduced in 2.4.21
direct_leIO Used to perform direct IO with a struct file.
Again, the
#define
exists for external modules as this API was only introduced in 2.4.22
removepage
An optional callback that is used when a page is removed from the
page cache in
remove_page_from_inode_queue()
4.4.3 Creating A Memory Region
The system call
mmap()
is provided for creating new memory regions within a pro-
sys_mmap2() which calls do_mmap2() directly
with the same parameters. do_mmap2() is responsible for acquiring the parameters
needed by do_mmap_pgoff(), which is the principle function for creating new areas
cess. For the x86, the function calls
for all architectures.
do_mmap2() rst clears the MAP_DENYWRITE and MAP_EXECUTABLE bits from
flags parameter as they are ignored by Linux, which is conrmed by the
mmap() manual page. If a le is being mapped, do_mmap2() will look up the
struct file based on the le descriptor passed as a parameter and acquire the
mm_struct→mmap_sem semaphore before calling do_mmap_pgoff().
do_mmap_pgoff() begins by performing some basic sanity checks. It rst checks
the
the appropriate lesystem or device functions are available if a le or device is being
mapped. It then ensures the size of the mapping is page aligned and that it does
not attempt to create a mapping in the kernel portion of the address space. It then
makes sure the size of the mapping does not overow the range of
that the process does not have too many mapped regions already.
pgoff and nally
4.4.4 Finding a Mapped Memory Region
Figure 4.4: Call Graph:
67
sys_mmap2()
This rest of the function is large but broadly speaking it takes the following
steps:
•
Sanity check the parameters;
•
Find a free linear address space large enough for the memory mapping.
get_unmapped_area() function
arch_get_unmapped_area() is called;
a lesystem or device specic
will be used otherwise
If
is provided, it
•
Calculate the VM ags and check them against the le access permissions;
•
If an old area exists where the mapping is to take place, x it up so that it is
suitable for the new mapping;
vm_area_struct
•
Allocate a
•
Link in the new VMA;
•
Call the lesystem or device specic mmap function;
•
Update statistics and exit.
from the slab allocator and ll in its entries;
4.4.4 Finding a Mapped Memory Region
A common operation is to nd the VMA a particular address belongs to, such
as during operations like page faulting, and the function responsible for this is
find_vma().
The function
find_vma()
and other API functions aecting memory
regions are listed in Table 4.3.
It rst checks the
find_vma()
mmap_cache
eld which caches the result of the last call to
as it is quite likely the same region will be needed a few times in
succession. If it is not the desired region, the red-black tree stored in the
mm_rb eld
is traversed. If the desired address is not contained within any VMA, the function
will return the VMA
closest
to the requested address so it is important callers double
check to ensure the returned VMA contains the desired address.
A second function called
same as
find_vma_prev() is provided which is functionally the
find_vma() except that it also returns a pointer to the VMA preceding the
4.4.5 Finding a Free Memory Region
68
desired VMA which is required as the list is a singly linked list.
find_vma_prev() is
rarely used but notably, it is used when two VMAs are being compared to determine
if they may be merged. It is also used when removing a memory region so that the
singly linked list may be updated.
The last function of note for searching VMAs is
find_vma_intersection()
which is used to nd a VMA which overlaps a given address range.
notable use of this is during a call to
do_brk()
The most
when a region is growing up. It is
important to ensure that the growing region will not overlap an old region.
4.4.5 Finding a Free Memory Region
When a new area is to be memory mapped, a free region has to be found that is
large enough to contain the new mapping. The function responsible for nding a
free area is
get_unmapped_area().
As the call graph in Figure 4.5 indicates, there is little work involved with nding
struct file
pgoff which is the
oset within the le that is been mapped. The requested address for the mapping
is passed as well as its length. The last parameter is the protection flags for the
an unmapped area. The function is passed a number of parameters. A
is passed representing the le or device to be mapped as well as
area.
Figure 4.5: Call Graph:
get_unmapped_area()
If a device is being mapped, such as a video card, the associated
f_op→get_unmapped_area()
is used.
This is because devices or les may have
additional requirements for mapping that generic code can not be aware of, such as
the address having to be aligned to a particular virtual address.
If there are no special requirements, the architecture specic function
arch_get_unmapped_area() is called.
Not all architectures provide their own func-
tion. For those that don't, there is a generic version provided in
mm/mmap.c.
4.4.6 Inserting a memory region
69
struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned
long addr)
Finds the VMA that covers a given address. If the region does not exist, it
returns the VMA closest to the requested address
struct vm_area_struct * find_vma_prev(struct mm_struct * mm,
unsigned long addr, struct vm_area_struct **pprev)
Same as find_vma() except it also also gives the VMA pointing to the returned
VMA. It is not often used, with sys_mprotect() being the notable exception, as
it is usually find_vma_prepare() that is required
struct vm_area_struct * find_vma_prepare(struct mm_struct * mm,
unsigned long addr, struct vm_area_struct ** pprev, rb_node_t ***
rb_link, rb_node_t ** rb_parent)
Same as find_vma() except that it will also the preceeding VMA in the linked
list as well as the red-black tree nodes needed to perform an insertion into the
tree
struct vm_area_struct * find_vma_intersection(struct mm_struct *
mm, unsigned long start_addr, unsigned long end_addr)
Returns the VMA which intersects a given address range. Useful when checking
if a linear address region is in use by any VMA
int vma_merge(struct mm_struct * mm, struct vm_area_struct * prev,
rb_node_t * rb_parent, unsigned long addr, unsigned long end,
unsigned long vm_flags)
Attempts to expand the supplied VMA to cover a new address range. If the
VMA can not be expanded forwards, the next VMA is checked to see if it may be
expanded backwards to cover the address range instead. Regions may be merged
if there is no le/device mapping and the permissions match
unsigned long get_unmapped_area(struct file *file, unsigned long
addr, unsigned long len, unsigned long pgoff, unsigned long flags)
Returns the address of a free region of memory large enough to cover the
requested size of memory. Used principally when a new VMA is to be created
void insert_vm_struct(struct mm_struct *, struct vm_area_struct *)
Inserts a new VMA into a linear address space
Table 4.3: Memory Region VMA API
4.4.6 Inserting a memory region
The principal function for inserting a new memory region is
whose call graph can be seen in Figure 4.6.
insert_vm_struct()
It is a very simple function which
4.4.6 Inserting a memory region
70
find_vma_prepare()
to nd the appropriate VMAs the new region is to
rst calls
be inserted between and the correct nodes within the red-black tree. It then calls
__vma_link()
to do the work of linking in the new VMA.
Figure 4.6: Call Graph:
The function
map_count
insert_vm_struct()
insert_vm_struct()
is rarely used as it does not increase the
__insert_vm_struct()
map_count.
vma_link() and __vma_link().
eld. Instead, the function commonly used is
which performs the same tasks except that it increments
Two varieties of linking functions are provided,
vma_link()
is intended for use when no locks are held.
It will acquire all the
necessary locks, including locking the le if the VMA is a le mapping before calling
__vma_link()
which places the VMA in the relevant lists.
It is important to note that many functions do not use the
functions but instead prefer to call
later
vma_link()
find_vma_prepare()
insert_vm_struct()
themselves followed by a
to avoid having to traverse the tree multiple times.
__vma_link() consists of three stages which are contained in
functions. __vma_link_list() inserts the VMA into the linear,
The linking in
three separate
singly linked list. If it is the rst mapping in the address space (i.e. prev is NULL),
it will become the red-black tree root node. The second stage is linking the node
__vma_link_rb(). The nal stage is xing up the le
__vma_link_file() which basically inserts the VMA into the
via the vm_pprev_share and vm_next_share elds.
into the red-black tree with
share mapping with
linked list of VMAs
4.4.7 Merging contiguous regions
71
4.4.7 Merging contiguous regions
Merging VMAs
Linux used to have a function called
merge_segments()
[Hac02] which was re-
sponsible for merging adjacent regions of memory together if the le and permissions
matched.
The objective was to remove the number of VMAs required, especially
as many operations resulted in a number of mappings been created such as calls
to
sys_mprotect().
This was an expensive operation as it could result in large
portions of the mappings been traversed and was later removed as applications,
merge_segments().
vma_merge() and it is only
especially those with many mappings, spent a long time in
The equivalent function which exists now is called
used in two places. The rst is user is
sys_mmap()
which calls it if an anonymous
region is being mapped, as anonymous regions are frequently mergable. The second
time is during
do_brk()
which is expanding one region into a newly allocated one
where the two regions should be merged.
function
vma_merge()
Rather than merging two regions, the
checks if an existing region may be expanded to satisfy the
new allocation negating the need to create a new region. A region may be expanded
if there are no le or device mappings and the permissions of the two areas are the
same.
Regions are merged elsewhere, although no function is explicitly called to perform
the merging. The rst is during a call to
sys_mprotect() during the xup of areas
where the two regions will be merged if the two sets of permissions are the same
after the permissions in the aected region change. The second is during a call to
move_vma()
when it is likely that similar regions will be located beside each other.
4.4.8 Remapping and moving a memory region
Moving VMAsRemapping VMAs
mremap()
is a system call provided to grow or shrink an existing memory map-
sys_mremap() which may move a memory
overlap another region and MREMAP_FIXED is not
ping. This is implemented by the function
region if it is growing or it would
specied in the ags. The call graph is illustrated in Figure 4.7.
do_mremap() rst calls get_unmapped_area() to nd
a region large enough to contain the new resized mapping and then calls move_vma()
If a region is to be moved,
to move the old VMA to the new location.
move_vma().
First move_vma()
See Figure 4.8 for the call graph to
checks if the new location may be merged with the VMAs
adjacent to the new location. If they can not be merged, a new VMA is allocated
literally one PTE at a time. Next
move_page_tables()
is called(see Figure 4.9 for
its call graph) which copies all the page table entries from the old mapping to the
new one.
While there may be better ways to move the page tables, this method
makes error recovery trivial as backtracking is relatively straight forward.
The contents of the pages are not copied. Instead,
zap_page_range()
is called
to swap out or remove all the pages from the old mapping and the normal page fault
handling code will swap the pages back in from backing storage or from les or will
4.4.9 Locking a Memory Region
72
Figure 4.7: Call Graph:
sys_mremap()
Figure 4.8: Call Graph:
call the device specic
do_nopage()
move_vma()
function.
4.4.9 Locking a Memory Region
Linux can lock pages from an address range into memory via the system call
which is implemented by
mlock()
sys_mlock() whose call graph is shown in Figure 4.10.
At
a high level, the function is simple; it creates a VMA for the address range to
VM_LOCKED ag on it and forces all the pages to be present
with make_pages_present(). A second system call mlockall() which maps to
sys_mlockall() is also provided which is a simple extension to do the same work
as sys_mlock() except for every VMA on the calling process. Both functions rely
on the core function do_mlock() to perform the real work of nding the aected
be locked, sets the
VMAs and deciding what function is needed to x up the regions as described later.
There are some limitations to what memory may be locked. The address range
must be page aligned as VMAs are page aligned.
This is addressed by simply
rounding the range up to the nearest page aligned range.
that the process limit
The second proviso is
RLIMIT_MLOCK imposed by the system administrator may not
be exceeded. The last proviso is that each process may only lock half of physical
memory at a time. This is a bit non-functional as there is nothing to stop a process
forking a number of times and each child locking a portion but as only root processes
are allowed to lock pages, it does not make much dierence. It is safe to presume
that a root process is trusted and knows what it is doing. If it does not, the system
administrator with the resulting broken system probably deserves it and gets to keep
4.4.10 Unlocking the region
Figure 4.9: Call Graph:
73
move_page_tables()
both parts of it.
4.4.10 Unlocking the region
Unlocking VMAs
The system calls
munlock() and munlockall() provide the corollary for the
sys_munlock() and sys_munlockall() respectively.
locking functions and map to
The functions are much simpler than the locking functions as they do not have to
make numerous checks. They both rely on the same
do_mmap()
function to x up
the regions.
4.4.11 Fixing up regions after locking
When locking or unlocking, VMAs will be aected in one of four ways, each of
mlock_fixup(). The locking may aect the whole
VMA in which case mlock_fixup_all() is called. The second condition, handled
by mlock_fixup_start(), is where the start of the region is locked, requiring that
which must be xed up by
a new VMA be allocated to map the new area.
The third condition, handled by
mlock_fixup_end(), is predictably enough where the end of the region is locked.
Finally, mlock_fixup_middle() handles the case where the middle of a region is
mapped requiring two new VMAs to be allocated.
It is interesting to note that VMAs created as a result of locking are never
merged, even when unlocked. It is presumed that processes which lock regions will
need to lock the same regions over and over again and it is not worth the processor
power to constantly merge and split regions.
4.4.12 Deleting a memory region
74
Figure 4.10: Call Graph:
sys_mlock()
4.4.12 Deleting a memory region
The function responsible for deleting memory regions, or parts thereof, is
do_munmap().
It is a relatively simple operation in comparison to the other memory region related
operations and is basically divided up into three parts.
The rst is to x up the
red-black tree for the region that is about to be unmapped. The second is to release
the pages and PTEs related to the region to be unmapped and the third is to x up
the regions if a hole has been generated.
Figure 4.11: Call Graph:
do_munmap()
To ensure the red-black tree is ordered correctly, all VMAs to be aected by the
unmap are placed on a linked list called
tree with
rb_erase().
free
and then deleted from the red-black
The regions if they still exist will be added with their new
addresses later during the xup.
Next the linked list VMAs on
sure it is not a partial unmapping.
free
is walked through and checked to en-
Even if a region is just to be partially un-
4.4.13 Deleting all memory regions
mapped,
ping.
75
remove_shared_vm_struct() is still called to remove the shared le map-
Again, if this is a partial unmapping, it will be recreated during xup.
zap_page_range() is called to remove all the pages associated with the region about
to be unmapped before unmap_fixup() is called to handle partial unmappings.
Lastly free_pgtables() is called to try and free up all the page table entries
associated with the unmapped region. It is important to note that the page table
entry freeing is not exhaustive. It will only unmap full PGD directories and their
entries so for example, if only half a PGD was used for the mapping, no page table
entries will be freed.
This is because a ner grained freeing of page table entries
would be too expensive to free up data structures that are both small and likely to
be used again.
4.4.13 Deleting all memory regions
During process exit, it is necessary to unmap all VMAs associated with a
The function responsible is
exit_mmap().
mm_struct.
It is a very simply function which ushes
the CPU cache before walking through the linked list of VMAs, unmapping each
of them in turn and freeing up the associated pages before ushing the TLB and
deleting the page table entries. It is covered in detail in the Code Commentary.
4.5 Exception Handling
A very important part of VM is how kernel address space exceptions that are not
1
bugs are caught . This section does
not
cover the exceptions that are raised with
errors such as divide by zero, we are only concerned with the exception raised as
the result of a page fault.
occur.
There are two situations where a bad reference may
The rst is where a process sends an invalid pointer to the kernel via a
system call which the kernel must be able to safely trap as the only check made
PAGE_OFFSET. The second is where the kernel
copy_to_user() to read or write data from userspace.
At compile time, the linker creates an exception table in the __ex_table section of the kernel code segment which starts at __start___ex_table and ends at
__stop___ex_table. Each entry is of type exception_table_entry which is a pair
initially is that the address is below
uses
copy_from_user()
or
consisting of an execution point and a xup routine. When an exception occurs that
the page fault handler cannot manage, it calls
search_exception_table() to see if
a xup routine has been provided for an error at the faulting instruction. If module
support is compiled, each modules exception table will also be searched.
If the address of the current exception is found in the table, the corresponding
location of the xup code is returned and executed. We will see in Section 4.7 how
this is used to trap bad reads and writes to userspace.
1 Many thanks go to Ingo Oeser for clearing up the details of how this is implemented.
4.6 Page Faulting
76
4.6 Page Faulting
Pages in the process linear address space are not necessarily resident in memory. For
example, allocations made on behalf of a process are not satised immediately as the
space is just reserved within the
vm_area_struct.
Other examples of non-resident
pages include the page having been swapped out to backing storage or writing a
read-only page.
Linux, like most operating systems, has a
Demand Fetch
policy as its fetch
policy for dealing with pages that are not resident. This states that the page is only
fetched from backing storage when the hardware raises a page fault exception which
the operating system traps and allocates a page.
The characteristics of backing
storage imply that some sort of page prefetching policy would result in less page
faults [MM87] but Linux is fairly primitive in this respect. When a page is paged
page_cluster
in from swap space, a number of pages after it, up to 2
are read in by
swapin_readahead()
and placed in the swap cache. Unfortunately there is only a
chance that pages likely to be used soon will be adjacent in the swap area making
it a poor prepaging policy. Linux would likely benet from a prepaging policy that
adapts to program behaviour [KMC02].
There are two types of page fault, major and minor faults. Major page faults
occur when data has to be read from disk which is an expensive operation, else
the fault is referred to as a minor, or soft page fault.
on the number of these types of page faults with the
task_struct→min_flt
Linux maintains statistics
task_struct→maj_flt
and
elds respectively.
The page fault handler in Linux is expected to recognise and act on a number
of dierent types of page faults listed in Table 4.4 which will be discussed in detail
later in this chapter.
Each architecture registers an architecture-specic function for the handling of
page faults.
While the name of this function is arbitrary, a common choice is
do_page_fault()
whose call graph for the x86 is shown in Figure 4.12.
This function is provided with a wealth of information such as the address of
the fault, whether the page was simply not found or was a protection error, whether
it was a read or write fault and whether it is a fault from user or kernel space. It
is responsible for determining which type of fault has occurred and how it should
be handled by the architecture-independent code. The ow chart, in Figure 4.13,
shows broadly speaking what this function does.
In the gure, identiers with a
colon after them corresponds to the label as shown in the code.
handle_mm_fault() is the architecture independent top level function for faulting in a page from backing storage, performing COW and so on. If it returns 1, it
was a minor fault, 2 was a major fault, 0 sends a
SIGBUS
error and any other value
invokes the out of memory handler.
4.6.1 Handling a Page Fault
Once the exception handler has decided the fault is a valid page fault in a valid
memory region, the architecture-independent function
handle_mm_fault(),
whose
4.6.1 Handling a Page Fault
77
Exception
Type
Action
Region valid but page not allo-
Minor
Allocate a page frame from the
Minor
Expand
cated
physical page allocator
Region not valid but is beside
an expandable region like the
the region and allo-
cate a page
stack
Page swapped out but present
Minor
in swap cache
Re-establish the page in the
process page tables and drop a
reference to the swap cache
Page swapped out to backing
Major
storage
Find where the page with information stored in the PTE and
read it from disk
Page write when marked read-
Minor
only
If the page is a COW page,
make a copy of it,
mark it
writable and assign it to the
process.
If it is in fact a bad
SIGSEGV signal
SEGSEGV signal to the
write, send a
Region is invalid or process has
Error
no permissions to access
Fault occurred in the kernel
Send a
process
Minor
If
the
fault
vmalloc
portion address space
occurred
in
the
area of the address
space, the current process page
tables are updated against the
master
page
init_mm.
table
held
by
This is the only valid
kernel page fault that may occur
Fault occurred in the userspace
Error
region while in kernel mode
If a fault occurs, it means a kernel system did not copy from
userspace properly and caused
a page fault.
This is a ker-
nel bug which is treated quite
severely.
Table 4.4: Reasons For Page Faulting
call graph is shown in Figure 4.14, takes over. It allocates the required page table
entries if they do not already exist and calls
handle_pte_fault().
Based on the properties of the PTE, one of the handler functions shown in Figure
4.14 will be used. The rst stage of the decision is to check if the PTE is marked
not present or if it has been allocated with which is checked by
pte_present()
pte_none(). If no PTE has been allocated (pte_none() returned true),
do_no_page() is called which handles Demand Allocation . Otherwise it is a page
and
4.6.2 Demand Allocation
78
Figure 4.12: Call Graph:
that has been swapped out to disk and
do_page_fault()
do_swap_page() performs Demand Paging .
There is a rare exception where swapped out pages belonging to a virtual le are
handled by
do_no_page().
This particular case is covered in Section 12.4.
The second option is if the page is being written to. If the PTE is write protected,
then
do_wp_page() is called as the page is a Copy-On-Write (COW)
page. A COW
page is one which is shared between multiple processes(usually a parent and child)
until a write occurs after which a private copy is made for the writing process. A
COW page is recognised because the VMA for the region is marked writable even
though the individual PTE is not.
If it is not a COW page, the page is simply
marked dirty as it has been written to.
The last option is if the page has been read and is present but a fault still
occurred.
This can occur with some architectures that do not have a three level
page table. In this case, the PTE is simply established and marked young.
4.6.2 Demand Allocation
When a process accesses a page for the very rst time, the page has to be
allocated and possibly lled with data by the do_no_page() function.
If the
vm_operations_struct associated with the parent VMA (vma→vm_ops) provides
a nopage() function, it is called. This is of importance to a memory mapped device
such as a video card which needs to allocate the page and supply data on access or
to a mapped le which must retrieve its data from backing storage. We will rst
discuss the case where the faulting page is anonymous as this is the simpliest case.
Handling anonymous pages
a
nopage()
function is not
vm_area_struct→vm_ops eld is not lled or
supplied, the function do_anonymous_page() is called
If
to handle an anonymous access. There are only two cases to handle, rst time read
4.6.2 Demand Allocation
Figure 4.13:
79
do_page_fault()
Flow Diagram
and rst time write. As it is an anonymous page, the rst read is an easy case as no
data exists. In this case, the system-wide
empty_zero_page, which is just a page of
zeros, is mapped for the PTE and the PTE is write protected. The write protection
is set so that another page fault will occur if the process writes to the page. On the
mem_init().
alloc_page() is called to allocate a free page
(see Chapter 6) and is zero lled by clear_user_highpage(). Assuming the page
was successfully allocated, the Resident Set Size (RSS) eld in the mm_struct will
be incremented; flush_page_to_ram() is called as required when a page has been
x86, the global zero-lled page is zerod out in the function
If this is the rst write to the page
inserted into a userspace process by some architectures to ensure cache coherency.
The page is then inserted on the LRU lists so it may be reclaimed later by the page
reclaiming code. Finally the page table entries for the process are updated for the
new mapping.
4.6.2 Demand Allocation
80
Figure 4.14: Call Graph:
handle_mm_fault()
Figure 4.15: Call Graph:
Handling le/device backed pages
do_no_page()
If backed by a le or device, a
nopage()
vm_operations_struct. In the lefilemap_nopage() is frequently the nopage() function
function will be provided within the VMAs
backed case, the function
for allocating a page and reading a page-sized amount of data from disk.
backed by a virtual le, such as those provided by
shmem_nopage() (See Chapter 12).
shmfs,
Pages
will use the function
nopage()
whose internals are unimportant to us here as long as it returns a valid struct page
Each device driver provides a dierent
to use.
On return of the page, a check is made to ensure a page was successfully allocated
and appropriate errors returned if not. A check is then made to see if an early COW
break should take place. An early COW break will take place if the fault is a write
to the page and the
VM_SHARED ag is not included in the managing VMA. An early
break is a case of allocating a new page and copying the data across before reducing
the reference count to the page returned by the
nopage()
function.
4.6.3 Demand Paging
81
In either case, a check is then made with
pte_none()
to ensure there is not a
PTE already in the page table that is about to be used. It is possible with SMP
that two faults would occur for the same page at close to the same time and as the
spinlocks are not held for the full duration of the fault, this check has to be made at
the last instant. If there has been no race, the PTE is assigned, statistics updated
and the architecture hooks for cache coherency called.
4.6.3 Demand Paging
When a page is swapped out to backing storage, the function
do_swap_page()
is
responsible for reading the page back in, with the exception of virtual les which
are covered in Section 12. The information needed to nd it is stored within the
PTE itself. The information within the PTE is enough to nd the page in swap. As
pages may be shared between multiple processes, they can not always be swapped
out immediately. Instead, when a page is swapped out, it is placed within the swap
cache.
Figure 4.16: Call Graph:
do_swap_page()
A shared page can not be swapped out immediately because there is no way of
mapping a
struct page to the PTEs of each process it is shared between.
Searching
the page tables of all processes is simply far too expensive. It is worth noting that
the late 2.5.x kernels and 2.4.x with a custom patch have what is called
Mapping (RMAP)
Reverse
which is discussed at the end of the chapter.
With the swap cache existing, it is possible that when a fault occurs it still exists
in the swap cache. If it is, the reference count to the page is simply increased and it
is placed within the process page tables again and registers as a minor page fault.
If the page exists only on disk
swapin_readahead() is called which reads in the
requested page and a number of pages after it.
determined by the variable
page_cluster
The number of pages read in is
dened in
mm/swap.c.
On low memory
machines with less than 16MiB of RAM, it is initialised as 2 or 3 otherwise. The
page_cluster
number of pages read in is 2
unless a bad or empty swap entry is encountered.
This works on the premise that a seek is the most expensive operation in
time so once the seek has completed, the succeeding pages should also be read in.
4.6.4 Copy On Write (COW) Pages
82
4.6.4 Copy On Write (COW) Pages
Once upon time, the full parent address space was duplicated for a child when
a process forked.
This was an extremely expensive operation as it is possible a
signicant percentage of the process would have to be swapped in from backing
storage.
(COW)
To avoid this considerable overhead, a technique called
Copy-On-Write
is employed.
Figure 4.17: Call Graph:
do_wp_page()
During fork, the PTEs of the two processes are made read-only so that when
a write occurs there will be a page fault.
Linux recognises a COW page because
even though the PTE is write protected, the controlling VMA shows the region is
writable. It uses the function
do_wp_page()
to handle it by making a copy of the
page and assigning it to the writing process. If necessary, a new swap slot will be
reserved for the page.
With this method, only the page table entries have to be
copied during a fork.
4.7 Copying To/From Userspace
It is not safe to access memory in the process address space directly as there is no way
to quickly check if the page addressed is resident or not. Linux relies on the MMU
to raise exceptions when the address is invalid and have the Page Fault Exception
handler catch the exception and x it up. In the x86 case, assembler is provided
by the
__copy_user()
to trap exceptions where the address is totally useless. The
location of the xup code is found when the function
search_exception_table()
is called. Linux provides an ample API (mainly macros) for copying data to and
from the user address space safely as shown in Table 4.5.
All the macros map on to assembler functions which all follow similar patterns of
implementation so for illustration purposes, we'll just trace how
copy_from_user()
is implemented on the x86.
copy_from_user() calls
__constant_copy_from_user() else __generic_copy_from_user() is used. If the
If the size of the copy is known at compile time,
size is known, there are dierent assembler optimisations to copy data in 1, 2 or 4
4.7 Copying To/From Userspace
83
unsigned long copy_from_user(void *to, const void *from, unsigned
long n)
Copies n bytes from the user address(from) to the kernel address space(to)
unsigned long copy_to_user(void *to, const void *from, unsigned
long n)
Copies n bytes from the kernel address(from) to the user address space(to)
void copy_user_page(void *to, void *from, unsigned long address)
This copies data to an anonymous or COW page in userspace.
Ports are
responsible for avoiding D-cache alises. It can do this by using a kernel virtual
address that would use the same cache lines as the virtual address.
void clear_user_page(void *page, unsigned long address)
Similar to copy_user_page() except it is for zeroing a page
void get_user(void *to, void *from)
Copies an integer value from userspace (from) to kernel space (to)
void put_user(void *from, void *to)
Copies an integer value from kernel space (from) to userspace (to)
long strncpy_from_user(char *dst, const char *src, long count)
Copies a null terminated string of at most count bytes long from userspace
(src) to kernel space (dst)
long strlen_user(const char *s, long n)
Returns the length, upper bound by n, of the
userspace string including the
terminating NULL
int access_ok(int type, unsigned long addr, unsigned long size)
Returns non-zero if the userspace block of memory is valid and zero otherwise
Table 4.5: Accessing Process Address Space API
byte strides otherwise the distinction between the two copy functions is not important.
The generic copy function eventually calls the function
in
<asm-i386/uaccess.h>
__copy_user_zeroing()
which has three important parts. The rst part is the
assembler for the actual copying of
size
number of bytes from userspace.
If any
page is not resident, a page fault will occur and if the address is valid, it will get
swapped in as normal.
__ex_table
The second part is xup code and the third part is the
mapping the instructions from the rst part to the xup code in the
second part.
These pairings, as described in Section 4.5, copy the location of the copy instruc-
4.8 What's New in 2.6
84
tions and the location of the xup code the kernel exception handle table by the
linker. If an invalid address is read, the function
call
do_page_fault() will fall through,
search_exception_table() and nd the EIP where the faulty read took place
and jump to the xup code which copies zeros into the remaining kernel space, xes
up registers and returns. In this manner, the kernel can safely access userspace with
no expensive checks and letting the MMU hardware handle the exceptions.
All the other functions that access userspace follow a similar pattern.
4.8 What's New in 2.6
Linear Address Space
The linear address space remains essentially the same as
2.4 with no modications that cannot be easily recognised. The main change is the
addition of a new page usable from userspace that has been entered into the xed
address virtual mappings. On the x86, this page is located at
the
vsyscall page .
0xFFFFF000 and called
Code is located at this page which provides the optimal method
for entering kernel-space from userspace. A userspace program now should use
0xFFFFF000
instead of the traditional
int 0x80
call
when entering kernel space.
struct mm_struct This struct has not changed signicantly. The rst change is
the addition of a free_area_cache eld which is initialised as TASK_UNMAPPED_BASE.
This eld is used to remember where the rst hole is in the linear address space to
improve search times. A small number of elds have been added at the end of the
struct which are related to core dumping and beyond the scope of this book.
struct vm_area_struct This struct also has not changed signicantly. The main
dierences is that the vm_next_share and vm_pprev_share has been replaced
with a proper linked list with a new eld called simply shared. The vm_raend
has been removed altogether as le readahead is implemented very dierently in
2.6. Readahead is mainly managed by a
struct file→f_ra.
mm/readahead.c.
struct file_ra_state
struct stored in
How readahead is implemented is described in a lot of detail
in
struct address_space
has been replaced with a
used as the
gfp_mask
and
gfp_mask eld
flags eld where the rst __GFP_BITS_SHIFT bits are
accessed with mapping_gfp_mask(). The remaining bits
The rst change is relatively minor. The
are used to store the status of asynchronous IO. The two ags that may be set are
AS_EIO
to indicate an IO error and
AS_ENOSPC
to indicate the lesystem ran out of
space during an asynchronous write.
This struct has a number of signicant additions, mainly related to the page
cache and le readahead.
As the elds are quite unique, we'll introduce them in
detail:
page_tree
This is a radix tree of all pages in the page cache for this mapping
indexed by the block the data is located on the physical disk. In 2.4, searching
4.8 What's New in 2.6
85
the page cache involved traversing a linked list, in 2.6, it is a radix tree lookup
which considerably reduces search times.
The radix tree is implemented in
lib/radix-tree.c;
page_lock
io_pages
Spinlock protecting
page_tree;
When dirty pages are to be written out, they are added to this
do_writepages() is called. As explained in the comment above
mpage_writepages() in fs/mpage.c, pages to be written out are placed on
list before
this list to avoid deadlocking by locking already locked by IO;
dirtied_when
This eld records, in jies, the rst time an inode was dirtied.
This eld determines where the inode is located on the
super_block→s_dirty
list. This prevents a frequently dirtied inode remaining at the top of the list
and starving writeout on other inodes;
backing_dev_info
is declared in
This eld records readahead related information. The struct
include/linux/backing-dev.h
with comments explaining the
elds;
private_list
This is a private list available to the address_space. If the helper
mark_buffer_dirty_inode() and sync_mapping_buffers() are
this list links buffer_heads via the buffer_head→b_assoc_buffers
functions
used,
eld;
private_lock
This spinlock is available for the
address_space.
The use of
this lock is very convoluted but some of the uses are explained in the long
ChangeLog for 2.5.17 (http://lwn.net/2002/0523/a/2.5.17.php3 ).
but it is
mainly related to protecting lists in other mappings which share buers in
private_list, but it would
private_list of another address_space sharing buers with this
this mapping. The lock would not protect this
protect the
mapping;
assoc_mapping
this mappings
truncate_count
address_space
private_list;
This is the
which backs buers contained in
is incremented when a region is being truncated by the function
invalidate_mmap_range(). The counter is examined during page fault by
do_no_page() to ensure that a page is not faulted in that was just invalidated.
struct address_space_operations
Most of the changes to this struct initially
look quite simple but are actually quite involved. The changed elds are:
writepage
writepage() callback has been changed to take an additional parameter struct writeback_control. This struct is responsible for recording
The
information about the writeback such as if it is congested or not, if the writer
is the page allocator for direct reclaim or
the backing
backing_dev_info
kupdated and contains a handle to
to control readahead;
4.8 What's New in 2.6
writepages
86
Moves all pages from
dirty_pages to io_pages before writing them
all out;
set_page_dirty
is an
address_space
specic method of dirtying a page. This
is mainly used by the backing storage
address_space_operations
and for
anonymous shared pages where there are no buers associated with the page
to be dirtied;
readpages
Used when reading in pages so that readahead can be accurately
controlled;
bmap
This has been changed to deal with disk sectors rather than unsigned longs
32
for devices larger than 2
bytes.
invalidatepage
This is a renaming change.
flushpage()
invalidatepage();
callback
direct_IO
has been renamed to
block_flushpage() and
block_invalidatepage()
the
and
This has been changed to use the new IO mechanisms in 2.6. The
new mechanisms are beyond the scope of this book;
Memory Regions
The operation of
mmap()
has two important changes.
The
rst is that it is possible for security modules to register a callback. This callback
is called
security_file_mmap()
which looks up a
security_ops
struct for the
relevant function. By default, this will be a NULL operation.
The second is that there is much stricter address space accounting code in place.
vm_area_structs
which are to be accounted will have the
VM_ACCOUNT
ag set,
which will be all userspace mappings. When userspace regions are created or de-
vm_acct_memory()
vm_committed_space. This gives
vm_unacct_memory()
stroyed, the functions
and
variable
the kernel a much better view of how
update the
much memory has been committed to userspace.
4GiB/4GiB User/Kernel Split
One limitation that exists for the 2.4.x kernels
is that the kernel has only 1GiB of virtual address space available which is visible
2
to all processes. At time of writing, a patch has been developed by Ingo Molnar
which allows the kernel to optionally have it's own full 4GiB address space.
patches are available from
http://redhat.com/ mingo/4g-patches/
The
and are included
in the -mm test trees but it is unclear if it will be merged into the mainstream or
not.
This feature is intended for 32 bit systems that have very large amounts (>
16GiB) of RAM. The traditional 3/1 split adequately supports up to 1GiB of RAM.
After that, high-memory support allows larger amounts to be supported by temporarily mapping high-memory pages but with more RAM, this forms a signicant
bottleneck.
For example, as the amount of physical RAM approached the 60GiB
2 See http://lwn.net/Articles/39283/ for the rst announcement of the patch.
4.8 What's New in 2.6
87
range, almost the entire of low memory is consumed by
mem_map.
By giving the
kernel it's own 4GiB virtual address space, it is much easier to support the memory
but the serious penalty is that there is a per-syscall TLB ush which heavily impacts
performance.
With the patch, there is only a small 16MiB region of memory shared between
userspace and kernelspace which is used to store the GDT, IDT, TSS, LDT, vsyscall
page and the kernel stack. The code for doing the actual switch between the pagetables is then contained in the trampoline code for entering/existing kernelspace.
There are a few changes made to the core core such as the removal of direct pointers
for accessing userspace buers but, by and large, the core kernel is unaected by
this patch.
Non-Linear VMA Population
a linear fashion.
the
In 2.4, a VMA backed by a le is populated in
This can be optionally changed in 2.6 with the introduction of
MAP_POPULATE ag to mmap() and the new system call remap_file_pages(),
sys_remap_file_pages(). This system call allows arbitrary pages
implemented by
in an existing VMA to be remapped to an arbitrary location on the backing le by
manipulating the page tables.
On page-out, the non-linear address for the le is encoded within the PTE so that
it can be installed again correctly on page fault. How it is encoded is architecture
specic so two macros are dened called
pgoff_to_pte()
and
pte_to_pgoff()
for
the task.
This feature is largely of benet to applications with a large number of mappings
such as database servers and virtualising applications such as emulators.
introduced for a number of reasons.
It was
First, VMAs are per-process and can have
considerable space requirements, especially for applications with a large number of
mappings.
Second, the search
get_unmapped_area()
uses for nding a free area
in the virtual address space is a linear search which is very expensive for large
numbers of mappings. Third, non-linear mappings will prefault most of the pages
into memory where as normal mappings may cause a major fault for each page
although can be avoided by using the new ag
my using
mlock().
MAP_POPULATE
ag with
mmap()
or
The last reason is to avoid sparse mappings which, at worst
case, would require one VMA for every le page mapped.
However, this feature is not without some serious drawbacks. The rst is that
the system calls
truncate()
and
mincore()
are broken with respect to non-linear
mappings. Both system calls depend depend on
vm_area_struct→vm_pgoff which
is meaningless for non-linear mappings. If a le mapped by a non-linear mapping
is truncated, the pages that exists within the VMA will still remain. It has been
proposed that the proper solution is to leave the pages in memory but make them
anonymous but at the time of writing, no solution has been implemented.
The second major drawback is TLB invalidations. Each remapped page will re-
flush_icache_page()
flush_tlb_page(). Some pro-
quire that the MMU be told the remapping took place with
but the more important penalty is with the call to
cessors are able to invalidate just the TLB entries related to the page but other
4.8 What's New in 2.6
88
processors implement this by ushing the entire TLB. If re-mappings are frequent,
the performance will degrade due to increased TLB misses and the overhead of constantly entering kernel space.
In some ways, these penalties are the worst as the
impact is heavily processor dependant.
It is currently unclear what the future of this feature, if it remains, will be. At
the time of writing, there is still on-going arguments on how the issues with the
feature will be xed but it is likely that non-linear mappings are going to be treated
very dierently to normal mappings with respect to pageout, truncation and the
reverse mapping of pages. As the main user of this feature is likely to be databases,
this special treatment is not likely to be a problem.
Page Faulting
The changes to the page faulting routines are more cosmetic than
anything else other than the necessary changes to support reverse mapping and PTEs
in high memory. The main cosmetic change is that the page faulting routines return
self explanatory compile time denitions rather than magic numbers.
handle_mm_fault()
VM_FAULT_SIGBUS and VM_FAULT_OOM.
ble return values for
are
The possi-
VM_FAULT_MINOR, VM_FAULT_MAJOR,
Chapter 5
Boot Memory Allocator
It is impractical to statically initialise all the core kernel memory structures at compile time as there are simply far too many permutations of hardware congurations.
Yet to set up even the basic structures requires memory as even the physical page
allocator, discussed in the next chapter, needs to allocate memory to initialise itself.
But how can the physical page allocator allocate memory to initialise itself ?
Boot Memory Allocator is used.
First Fit allocator which uses a bitmap
To address this, a specialised allocator called the
It is based on the most basic of allocators, a
to represent memory [Tan01] instead of linked lists of free blocks. If a bit is 1, the
page is allocated and 0 if unallocated. To satisfy allocations of sizes smaller than
a page, the allocator records the
Page Frame Number (PFN)
of the last allocation
and the oset the allocation ended at. Subsequent small allocations are merged
together and stored on the same page.
The reader may ask why this allocator is not used for the running system. One
compelling reason is that although the rst t allocator does not suer badly from
fragmentation [JW98], memory frequently has to linearly searched to satisfy an
allocation. As this is examining bitmaps, it gets very expensive, especially as the
rst t algorithm tends to leave many small free blocks at the beginning of physical
memory which still get scanned for large allocations, thus making the process very
wasteful [WJNB95].
There are two very similar but distinct APIs for the allocator. One is for UMA
architectures, listed in Table 5.1 and the other is for NUMA, listed in Table 5.2. The
principle dierence is that the NUMA API must be supplied with the node aected
by the operation but as the callers of these APIs exist in the architecture dependant
layer, it is not a signicant problem.
This chapter will begin with a description of the structure the allocator uses
to describe the physical memory available for each node.
We will then illustrate
how the limits of physical memory and the sizes of each zone are discovered before
talking about how the information is used to initialised the boot memory allocator
structures.
The allocation and free routines will then be discussed before nally
talking about how the boot memory allocator is retired.
89
5.1 Representing the Boot Map
90
unsigned long init_bootmem(unsigned long start, unsigned long page)
This initialises the memory between 0 and the PFN page. The beginning of
usable memory is at the PFN start
void reserve_bootmem(unsigned long addr, unsigned long size)
Mark the pages between the address addr and addr+size reserved. Requests
to partially reserve a page will result in the full page being reserved
void free_bootmem(unsigned long addr, unsigned long size)
Mark the pages between the address addr and addr+size free
void * alloc_bootmem(unsigned long size)
Allocate size number of bytes from ZONE_NORMAL.
The allocation will be
aligned to the L1 hardware cache to get the maximum benet from the hardware cache
void * alloc_bootmem_low(unsigned long size)
Allocate size number of bytes from ZONE_DMA. The
allocation will be aligned
to the L1 hardware cache
void * alloc_bootmem_pages(unsigned long size)
Allocate size number of bytes from ZONE_NORMAL aligned
on a page size so
that full pages will be returned to the caller
void * alloc_bootmem_low_pages(unsigned long size)
Allocate size number of bytes from ZONE_NORMAL aligned
on a page size so
that full pages will be returned to the caller
unsigned long bootmem_bootmap_pages(unsigned long pages)
Calculate the number of pages required to store a bitmap representing the
allocation state of
pages
number of pages
unsigned long free_all_bootmem()
Used at the boot allocator end of life. It cycles through all pages in the bitmap.
For each one that is free, the ags are cleared and the page is freed to the physical
page allocator (See next chapter) so the runtime allocator can set up its free lists
Table 5.1: Boot Memory Allocator API for UMA Architectures
5.1 Representing the Boot Map
A
bootmem_data
struct exists for each node of memory in the system. It contains
the information needed for the boot memory allocator to allocate memory for a node
such as the bitmap representing allocated pages and where the memory is located.
It is declared as follows in
<linux/bootmem.h>:
5.1 Representing the Boot Map
91
unsigned long init_bootmem_node(pg_data_t *pgdat, unsigned long
freepfn, unsigned long startpfn, unsigned long endpfn)
For use with NUMA architectures.
It initialise the memory between PFNs
startpfn and endpfn with the rst usable PFN
pgdat node is inserted into the pgdat_list
at
freepfn.
Once initialised,
the
void reserve_bootmem_node(pg_data_t *pgdat, unsigned long physaddr,
unsigned long size)
Mark the pages between the address addr and addr+size on the specied node
pgdat reserved. Requests to partially reserve a page will result in the full page
being reserved
void free_bootmem_node(pg_data_t *pgdat, unsigned long physaddr,
unsigned long size)
Mark the pages between the address addr and addr+size on the specied node
pgdat free
void * alloc_bootmem_node(pg_data_t *pgdat, unsigned long size)
Allocate size number of bytes from ZONE_NORMAL on the specied node pgdat.
The allocation will be aligned to the L1 hardware cache to get the maximum
benet from the hardware cache
void * alloc_bootmem_pages_node(pg_data_t *pgdat, unsigned long
size)
Allocate size number of bytes from ZONE_NORMAL on the specied node pgdat
aligned on a page size so that full pages will be returned to the caller
void * alloc_bootmem_low_pages_node(pg_data_t *pgdat, unsigned long
size)
Allocate size number of bytes from ZONE_NORMAL on the specied node pgdat
aligned on a page size so that full pages will be returned to the caller
unsigned long free_all_bootmem_node(pg_data_t *pgdat)
Used at the boot allocator end of life. It cycles through all pages in the bitmap
for the specied node. For each one that is free, the page ags are cleared and
the page is freed to the physical page allocator (See next chapter) so the runtime
allocator can set up its free lists
Table 5.2: Boot Memory Allocator API for NUMA Architectures
25 typedef struct bootmem_data {
26
unsigned long node_boot_start;
27
unsigned long node_low_pfn;
28
void *node_bootmem_map;
29
unsigned long last_offset;
30
unsigned long last_pos;
31 } bootmem_data_t;
5.2 Initialising the Boot Memory Allocator
92
The elds of this struct are as follows:
node_boot_start
node_low_pfn
This is the starting physical address of the represented block;
This is the end physical address, in other words, the end of the
ZONE_NORMAL
this node represents;
node_bootmem_map
This is the location of the bitmap representing allocated
or free pages with each bit;
last_oset
This is the oset within the the page of the end of the last allocation.
If 0, the page used is full;
last_pos
This is the the PFN of the page used with the last allocation. Using
this with the
last_offset
eld, a test can be made to see if allocations can
be merged with the page used for the last allocation rather than using up a
full new page.
5.2 Initialising the Boot Memory Allocator
Each architecture is required to supply a
setup_arch()
function which, among
other tasks, is responsible for acquiring the necessary parameters to initialise the
boot memory allocator.
Each architecture has its own function to get the necessary parameters.
the x86, it is called
setup_memory(),
as discussed in Section 2.2.2, but on other
architectures such as MIPS or Sparc, it is called
PPC,
do_init_bootmem().
On
bootmem_init() or the case of the
Regardless of the architecture, the tasks are essentially
the same. The parameters it calculates are:
min_low_pfn
This is the lowest PFN that is available in the system;
max_low_pfn
This is the highest PFN that may be addressed by low memory
highstart_pfn
This is the PFN of the beginning of high memory (ZONE_HIGHMEM);
(ZONE_NORMAL);
highend_pfn
max_pfn
This is the last PFN in high memory;
Finally, this is the last PFN available to the system.
5.2.1 Initialising bootmem_data
Once the limits of usable physical memory are discovered by
setup_memory(),
one
of two boot memory initialisation functions is selected and provided with the start
and end PFN for the node to be initialised.
contig_page_data,
init_bootmem(), which initialises
init_bootmem_node()
is used by UMA architectures, while
5.3 Allocating Memory
93
is for NUMA to initialise a specied node.
init_bootmem_core()
Both function are trivial and rely on
to do the real work.
pgdat_data_t
into the
as at the end of this function, the node is ready for use.
It then
The rst task of the core function is to insert this
pgdat_list
records the starting and end address for this node in its associated
bootmem_data_t
and allocates the bitmap representing page allocations. The size in bytes, hence the
division by 8, of the bitmap required is calculated as:
mapsize =
(end_pfn − start_pfn) + 7
8
The bitmap in stored at the physical address pointed to by
bootmem_data_t→node_boot_start and the virtual address to the map is placed in
bootmem_data_t→node_bootmem_map. As there is no architecture independent way
to detect holes in memory, the entire bitmap is initialised to 1, eectively marking
all pages allocated. It is up to the architecture dependent code to set the bits of usable pages to 0 although, in reality, the Sparc architecture is the only one which uses
register_bootmem_low_pages()
reads through the e820 map and calls free_bootmem() for each usable page to set
the bit to 0 before using reserve_bootmem() to reserve the pages needed by the
this bitmap. In the case of the x86, the function
actual bitmap.
5.3 Allocating Memory
The
reserve_bootmem()
function may be used to reserve pages for use by the
caller but is very cumbersome to use for general allocations. There are four func-
alloc_bootmem(),
alloc_bootmem_low(), alloc_bootmem_pages() and alloc_bootmem_low_pages()
which are fully described in Table 5.1. All of these macros call __alloc_bootmem()
tions provided for easy allocations on UMA architectures called
with dierent parameters. The call graph for these functions is shown in in Figure
5.1.
Figure 5.1: Call Graph:
alloc_bootmem()
5.4 Freeing Memory
94
Similar functions exist for NUMA which take the node as an additional
alloc_bootmem_node(),
alloc_bootmem_pages_node() and alloc_bootmem_low_pages_node(). All of
these macros call __alloc_bootmem_node() with dierent parameters.
The parameters to either __alloc_bootmem() and __alloc_bootmem_node()
parameter,
as listed in Table 5.2.
They are called
are essentially the same. They are
pgdat
This is the node to allocate from. It is omitted in the UMA case as it is
assumed to be
size
contig_page_data;
This is the size in bytes of the requested allocation;
align
This is the number of bytes that the request should be aligned to. For small
allocations, they are aligned to
SMP_CACHE_BYTES, which on the x86 will align
to the L1 hardware cache;
goal
This is the preferred starting address to begin allocating from. The low
functions will start from physical address 0 where as the others will begin
from
MAX_DMA_ADDRESS
which is the maximum address DMA transfers may
be made from on this architecture.
The core function for all the allocation APIs is
__alloc_bootmem_core().
It
is a large function but with simple steps that can be broken down. The function
linearly scans memory starting from the
enough to satisfy the allocation.
DMA-friendly allocations or
goal
address for a block of memory large
With the API, this address will either be 0 for
MAX_DMA_ADDRESS
otherwise.
The clever part, and the main bulk of the function, deals with deciding if this new
allocation can be merged with the previous one. It may be merged if the following
conditions hold:
•
The page used for the previous allocation (bootmem_data→pos) is adjacent
to the page found for this allocation;
•
The previous page has some free space in it (bootmem_data→offset != 0);
•
The alignment is less than
PAGE_SIZE.
Regardless of whether the allocations may be merged or not, the
pos and offset
elds will be updated to show the last page used for allocating and how much of the
last page was used. If the last page was fully used, the oset is 0.
5.4 Freeing Memory
In contrast to the allocation functions, only two free function are provided which
free_bootmem() for UMA and free_bootmem_node() for
call free_bootmem_core() with the only dierence being that
are
with NUMA.
NUMA. They both
a
pgdat
is supplied
5.5 Retiring the Boot Memory Allocator
95
The core function is relatively simple in comparison to the rest of the allocator.
For each
full
page aected by the free, the corresponding bit in the bitmap is set
to 0. If it already was 0,
BUG()
is called to show a double-free occured.
used when an unrecoverable error due to a kernel bug occurs.
BUG()
is
It terminates the
running process and causes a kernel oops which shows a stack trace and debugging
information that a developer can use to x the bug.
An important restriction with the free functions is that only full pages may be
freed.
It is never recorded when a page is partially allocated so if only partially
freed, the full page remains reserved. This is not as major a problem as it appears
as the allocations always persist for the lifetime of the system; However, it is still
an important restriction for developers during boot time.
5.5 Retiring the Boot Memory Allocator
Late in the bootstrapping process, the function
start_kernel()
is called which
knows it is safe to remove the boot allocator and all its associated data structures.
Each architecture is required to provide a function
mem_init()
that is responsible
for destroying the boot memory allocator and its associated structures.
Figure 5.2: Call Graph:
mem_init()
The purpose of the function is quite simple. It is responsible for calculating the
dimensions of low and high memory and printing out an informational message to
the user as well as performing nal initialisations of the hardware if necessary. On
the x86, the principal function of concern for the VM is the
free_pages_init().
5.5 Retiring the Boot Memory Allocator
96
This function rst tells the boot memory allocator to retire itself by call-
free_all_bootmem() for UMA architectures or free_all_bootmem_node() for
NUMA. Both call the core function free_all_bootmem_core() with dierent pa-
ing
rameters. The core function is simple in principle and performs the following tasks:
•
For all unallocated pages known to the allocator for this node;
Clear the
PG_reserved
ag in its struct page;
Set the count to 1;
Call
__free_pages()
so that the buddy allocator (discussed next chap-
ter) can build its free lists.
•
Free all pages used for the bitmap and give them to the buddy allocator.
At this stage, the buddy allocator now has control of all the pages in low memory which leaves only the high memory pages. After
free_all_bootmem()
returns,
it rst counts the number of reserved pages for accounting purposes. The remainder of the
free_pages_init()
function is responsible for the high memory pages.
However, at this point, it should be clear how the global
mem_map array is allocated,
initialised and the pages given to the main allocator. The basic ow used to initialise
pages in low memory in a single node system is shown in Figure 5.3.
Figure 5.3: Initialising
mem_map
ZONE_NORMAL have been
given to the buddy allocator. To initialise the high memory pages, free_pages_init()
calls one_highpage_init() for every page between highstart_pfn and highend_pfn.
Once
free_all_bootmem()
and the Main Physical Page Allocator
returns, all the pages in
5.6 What's New in 2.6
one_highpage_init()
97
PG_reserved ag, sets the PG_highmem
__free_pages() to release it to the buddy allofree_all_bootmem_core() did.
simple clears the
ag, sets the count to 1 and calls
cator in the same manner
At this point, the boot memory allocator is no longer required and the buddy
allocator is the main physical page allocator for the system. An interesting feature
to note is that not only is the data for the boot allocator removed but also all code
that was used to bootstrap the system. All initilisation function that are required
only during system start-up are marked
__init
such as the following;
321 unsigned long __init free_all_bootmem (void)
All of these functions are placed together in the
the x86, the function
to
__init_end
free_initmem()
.init section by the linker. On
__init_begin
walks through all pages from
and frees up the pages to the buddy allocator. With this method,
Linux can free up a considerable amount of memory that is used by bootstrapping
code that is no longer required. For example, 27 pages were freed while booting the
kernel running on the machine this document is composed on.
5.6 What's New in 2.6
The boot memory allocator has not changed signicantly since 2.4 and is mainly
concerned with optimisations and some minor NUMA related modications. The
rst optimisation is the addition of a
last_success
eld to the
bootmem_data_t
struct. As the name suggests, it keeps track of the location of the last successful
allocation to reduce search times. If an address is freed before
last_success, it will
be changed to the freed location.
The second optimisation is also related to the linear search.
When searching
for a free page, 2.4 test every bit which is expensive. 2.6 instead tests if a block of
BITS_PER_LONG
is all ones. If it's not, it will test each of the bits individually in
that block. To help the linear search, nodes are ordered in order of their physical
addresses by
init_bootmem().
The last change is related to NUMA and contiguous architectures. Contiguous
init_bootmem() function and
reserve_bootmem() function.
architectures now dene their own
can optionally dene their own
any architecture
Chapter 6
Physical Page Allocation
This chapter describes how physical pages are managed and allocated in Linux.
The principal algorithmm used is the
Binary Buddy Allocator ,
devised by Knowl-
ton [Kno65] and further described by Knuth [Knu68]. It is has been shown to be
extremely fast in comparison to other allocators [KB85].
This is an allocation scheme which combines a normal power-of-two allocator
with free buer coalescing [Vah96] and the basic concept behind it is quite simple.
Memory is broken up into large blocks of pages where each block is a power of two
number of pages.
If a block of the desired size is not available, a large block is
broken up in half and the two blocks are
buddies
to each other. One half is used for
the allocation and the other is free. The blocks are continuously halved as necessary
until a block of the desired size is available. When a block is later freed, the buddy
is examined and the two coalesced if it is free.
This chapter will begin with describing how Linux remembers what blocks of
memory are free. After that the methods for allocating and freeing pages will be
discussed in details.
The subsequent section will cover the ags which aect the
allocator behaviour and nally the problem of fragmentation and how the allocator
handles it will be covered.
6.1 Managing Free Blocks
As stated, the allocator maintains blocks of free pages where each block is a power
of two number of pages. The exponent for the power of two sized block is referred to
as the
order .
An array of
free_area_t
structs are maintained for each order that
points to a linked list of blocks of pages that are free as indicated by Figure 6.1.
Hence, the 0th element of the array will point to a list of free page blocks of
0
1
MAX_ORDER−1
size 2 or 1 page, the 1st element will be a list of 2 (2) pages up to 2
number of pages, where the
MAX_ORDER
is currently dened as 10. This eliminates
the chance that a larger block will be split to satisfy a request where a smaller
block would have suced. The page blocks are maintained on a linear linked list
via
page→list.
Each zone has a
free_area_t
struct array called
98
free_area[MAX_ORDER].
It is
6.2 Allocating Pages
99
Figure 6.1: Free page block management
declared in
<linux/mm.h>
as follows:
22 typedef struct free_area_struct {
23
struct list_head
free_list;
24
unsigned long
*map;
25 } free_area_t;
The elds in this struct are simply:
free_list
map
A linked list of free page blocks;
A bitmap representing the state of a pair of buddies.
Linux saves memory by only using one bit instead of two to represent each pair
of buddies. Each time a buddy is allocated or freed, the bit representing the pair of
buddies is toggled so that the bit is zero if the pair of pages are both free or both full
and 1 if only one buddy is in use. To toggle the correct bit, the macro
in
page_alloc.c
MARK_USED()
is used which is declared as follows:
164 #define MARK_USED(index, order, area) \
165
__change_bit((index) >> (1+(order)), (area)->map)
index
is the index of the page within the global
it right by
1+order
mem_map
array.
By shifting
bits, the bit within map representing the pair of buddies is
revealed.
6.2 Allocating Pages
Linux provides a quite sizable API for the allocation of page frames. All of them take
a
gfp_mask
as a parameter which is a set of ags that determine how the allocator
will behave. The ags are discussed in Section 6.4.
The allocation API functions all use the core function
__alloc_pages() but the
APIs exist so that the correct node and zone will be chosen.
Dierent users will
6.2 Allocating Pages
100
struct page * alloc_page(unsigned int gfp_mask)
Allocate a single page and return a struct address
struct page * alloc_pages(unsigned int gfp_mask, unsigned int
order)
order
Allocate 2
number of pages and returns a struct page
unsigned long get_free_page(unsigned int gfp_mask)
Allocate a single page, zero it and return a virtual address
unsigned long __get_free_page(unsigned int gfp_mask)
Allocate a single page and return a virtual address
unsigned long __get_free_pages(unsigned int gfp_mask, unsigned int
order)
order
Allocate 2
number of pages and return a virtual address
struct page * __get_dma_pages(unsigned int gfp_mask, unsigned int
order)
order
Allocate 2
number of pages from the DMA zone and return a struct page
Table 6.1: Physical Pages Allocation API
require dierent zones such as
ZONE_DMA
for certain device drivers or
ZONE_NORMAL
for disk buers and callers should not have to be aware of what node is being used.
A full list of page allocation APIs are listed in Table 6.1.
Allocations are always for a specied order, 0 in the case where a single page is
required. If a free block cannot be found of the requested order, a higher order block
is split into two buddies. One is allocated and the other is placed on the free list for
4
the lower order. Figure 6.2 shows where a 2 block is split and how the buddies are
added to the free lists until a block for the process is available.
When the block is later freed, the buddy will be checked. If both are free, they
are merged to form a higher order block and placed on the higher free list where its
buddy is checked and so on. If the buddy is not free, the freed block is added to the
free list at the current order. During these list manipulations, interrupts have to be
disabled to prevent an interrupt handler manipulating the lists while a process has
them in an inconsistent state. This is achieved by using an interrupt safe spinlock.
The second decision to make is which memory node or
pg_data_t
to use.
Linux uses a node-local allocation policy which aims to use the memory bank associated with the CPU running the page allocating process.
_alloc_pages()
whether the kernel is built for a UMA (function in
(function in
Here, the function
is what is important as this function is dierent depending on
mm/numa.c)
mm/page_alloc.c)
or NUMA
machine.
Regardless of which API is used,
__alloc_pages()
in
mm/page_alloc.c
is the
heart of the allocator. This function, which is never called directly, examines the
6.2 Allocating Pages
101
Figure 6.2: Allocating physical pages
selected zone and checks if it is suitable to allocate from based on the number of
available pages.
If the zone is not suitable, the allocator may fall back to other
zones. The order of zones to fall back on are decided at boot time by the function
build_zonelists() but generally ZONE_HIGHMEM will fall back to ZONE_NORMAL and
that in turn will fall back to ZONE_DMA. If number of free pages reaches the pages_low
kswapd to begin freeing up pages from zones and if memory
is extremely tight, the caller will do the work of kswapd itself.
watermark, it will wake
Figure 6.3: Call Graph:
alloc_pages()
Once the zone has nally been decided on, the function
rmqueue()
is called to
allocate the block of pages or split higher level blocks if one of the appropriate size
is not available.
6.3 Free Pages
102
6.3 Free Pages
The API for the freeing of pages is a lot simpler and exists to help remember the
order of the block to free as one disadvantage of a buddy allocator is that the caller
has to remember the size of the original allocation. The API for freeing is listed in
Table 6.2.
void __free_pages(struct page *page, unsigned int order)
Free an order number of pages from the given page
void __free_page(struct page *page)
Free a single page
void free_page(void *addr)
Free a page from the given virtual address
Table 6.2: Physical Pages Free API
__free_pages_ok() and it should not
__free_pages() is provided which performs
The principal function for freeing pages is
be called directly. Instead the function
simple checks rst as indicated in Figure 6.4.
Figure 6.4: Call Graph:
__free_pages()
When a buddy is freed, Linux tries to coalesce the buddies together immediately
if possible. This is not optimal as the worst case scenario will have many coalitions
followed by the immediate splitting of the same blocks [Vah96].
To detect if the buddies can be merged or not, Linux checks the bit corresponding
to the aected pair of buddies in
free_area→map.
As one buddy has just been freed
by this function, it is obviously known that at least one buddy is free. If the bit in
the map is 0 after toggling, we know that the other buddy must also be free because
6.4 Get Free Page (GFP) Flags
103
if the bit is 0, it means both buddies are either both free or both allocated. If both
are free, they may be merged.
Calculating the address of the buddy is a well known concept [Knu68]. As the
k
allocations are always in blocks of size 2 , the address of the block, or at least its
k
oset within zone_mem_map will also be a power of 2 . The end result is that there
will always be at least
k
number of zeros to the right of the address.
address of the buddy, the
kth
bit from the right is examined.
buddy will have this bit ipped.
To get the
If it is 0, then the
To get this bit, Linux creates a
mask
which is
calculated as
mask = (∼ 0 << k)
The mask we are interested in is
imask = 1+ ∼ mask
Linux takes a shortcut in calculating this by noting that
imask = −mask = 1+ ∼ mask
Once the buddy is merged, it is removed for the free list and the newly coalesced
pair moves to the next higher order to see if it may also be merged.
6.4 Get Free Page (GFP) Flags
A persistent concept through the whole VM is the
ags determine how the allocator and
kswapd
Get Free Page (GFP) ags.
freeing of pages. For example, an interrupt handler may not sleep so it will
the
__GFP_WAIT
These
will behave for the allocation and
not
have
ag set as this ag indicates the caller may sleep. There are three
sets of GFP ags, all dened in
<linux/mm.h>.
The rst of the three is the set of zone modiers listed in Table 6.3. These ags
indicate that the caller must try to allocate from a particular zone.
will note there is not a zone modier for
ZONE_NORMAL.
The reader
This is because the zone
modier ag is used as an oset within an array and 0 implicitly means allocate
from
ZONE_NORMAL.
Flag
__GFP_DMA
__GFP_HIGHMEM
GFP_DMA
Description
ZONE_DMA if possible
Allocate from ZONE_HIGHMEM if possible
Alias for __GFP_DMA
Allocate from
Table 6.3: Low Level GFP Flags Aecting Zone Allocation
The next ags are action modiers listed in Table 6.4. They change the behaviour
of the VM and what the calling process may do. The low level ags on their own
are too primitive to be easily used.
6.4 Get Free Page (GFP) Flags
Flag
104
Description
__GFP_WAIT
Indicates that the caller is not high priority and can
__GFP_HIGH
Used by a high priority or kernel process. Kernel 2.2.x
sleep or reschedule
used it to determine if a process could access emergency
pools of memory. In 2.4.x kernels, it does not appear to
be used
__GFP_IO
Indicates that the caller can perform low level IO.
In 2.4.x,
the main aect this has is determining if
try_to_free_buffers()
can ush buers or not.
It
is used by at least one journaled lesystem
__GFP_HIGHIO
Determines that IO can be performed on pages mapped
__GFP_FS
Indicates if the caller can make calls to the lesystem
in high memory. Only used in
try_to_free_buffers()
layer. This is used when the caller is lesystem related,
the buer cache for instance, and wants to avoid recursively calling itself
Table 6.4: Low Level GFP Flags Aecting Allocator behaviour
It is dicult to know what the correct combinations are for each instance so
a few high level combinations are dened and listed in Table 6.5. For clarity the
__GFP_ is removed from the table combinations so, the __GFP_HIGH ag will read as
HIGH below. The combinations to form the high level ags are listed in Table 6.6 To
help understand this, take GFP_ATOMIC as an example. It has only the __GFP_HIGH
ag set. This means it is high priority, will use emergency pools (if they exist) but
will not sleep, perform IO or access the lesystem. This ag would be used by an
interrupt handler for example.
Flag
GFP_ATOMIC
GFP_NOIO
GFP_NOHIGHIO
GFP_NOFS
GFP_KERNEL
GFP_NFS
GFP_USER
GFP_HIGHUSER
GFP_KSWAPD
Low Level Flag Combination
HIGH
HIGH | WAIT
HIGH | WAIT | IO
HIGH | WAIT | IO | HIGHIO
HIGH | WAIT | IO | HIGHIO | FS
HIGH | WAIT | IO | HIGHIO | FS
WAIT | IO | HIGHIO | FS
WAIT | IO | HIGHIO | FS | HIGHMEM
WAIT | IO | HIGHIO | FS
Table 6.5: Low Level GFP Flag Combinations For High Level Use
6.4.1 Process Flags
Flag
GFP_ATOMIC
105
Description
This ag is used whenever the caller cannot sleep and must
be serviced if at all possible.
Any interrupt handler that re-
quires memory must use this ag to avoid sleeping or performing IO. Many subsystems during init will use this system such
as
GFP_NOIO
buffer_init()
and
inode_init()
This is used by callers who are already performing an IO related
function. For example, when the loop back device is trying to
get a page for a buer head, it uses this ag to make sure it will
not perform some action that would result in more IO. If fact, it
appears the ag was introduced specically to avoid a deadlock
in the loopback device.
GFP_NOHIGHIO
This is only used in one place in
alloc_bounce_page()
during
the creating of a bounce buer for IO in high memory
GFP_NOFS
This is only used by the buer cache and lesystems to make
sure they do not recursively call themselves by accident
GFP_KERNEL
The most liberal of the combined ags.
It indicates that the
caller is free to do whatever it pleases.
Strictly speaking the
dierence between this ag and
GFP_USER
is that this could use
emergency pools of pages but that is a no-op on 2.4.x kernels
GFP_USER
Another ag of historical signicance.
In the 2.2.x series, an
allocation was given a LOW, MEDIUM or HIGH priority.
memory was tight, a request with
GFP_USER
If
(low) would fail
where as the others would keep trying. Now it has no signicance
and is not treated any dierent to
GFP_HIGHUSER
GFP_KERNEL
This ag indicates that the allocator should allocate from
ZONE_HIGHMEM
if possible. It is used when the page is allocated
on behalf of a user process
GFP_NFS
This ag is defunct.
In the 2.0.x series, this ag determined
what the reserved page size was. Normally 20 free pages were
reserved. If this ag was set, only 5 would be reserved. Now it
is not treated dierently anywhere
GFP_KSWAPD
More historical signicance.
dierent to
In reality this is not treated any
GFP_KERNEL
Table 6.6: High Level GFP Flags Aecting Allocator Behaviour
6.4.1 Process Flags
task_struct which aects allocator behaviour.
dened in <linux/sched.h> but only the ones
A process may also set ags in the
The full list of process ags are
aecting VM behaviour are listed in Table 6.7.
6.5 Avoiding Fragmentation
Flag
106
Description
PF_MEMALLOC
This ags the process as a memory allocator.
kswapd
sets this ag and it is set for any process that is about
to be killed by the
Out Of Memory (OOM) killer which
is discussed in Chapter 13. It tells the buddy allocator
to ignore zone watermarks and assign the pages if at
all possible
PF_MEMDIE
This is set by the OOM killer and functions the same
as the
PF_MEMALLOC
ag by telling the page allocator
to give pages if at all possible as the process is about
to die
PF_FREE_PAGES
Set
when
the
buddy
try_to_free_pages()
itself
allocator
to
indicate
that
calls
free
pages should be reserved for the calling process in
__free_pages_ok()
instead of returning to the free
lists
Table 6.7: Process Flags Aecting Allocator behaviour
6.5 Avoiding Fragmentation
One important problem that must be addressed with any allocator is the problem
of internal and external fragmentation. External fragmentation is the inability to
service a request because the available memory exists only in small blocks. Internal fragmentation is dened as the wasted space where a large block had to be
assigned to service a small request. In Linux, external fragmentation is not a serious
problem as large requests for contiguous pages are rare and usually
vmalloc()
(see
Chapter 7) is sucient to service the request. The lists of free blocks ensure that
large blocks do not have to be split unnecessarily.
Internal fragmentation is the single most serious failing of the binary buddy
system.
While fragmentation is expected to be in the region of 28% [WJNB95],
it has been shown that it can be in the region of 60%, in comparison to just 1%
with the rst t allocator [JW98]. It has also been shown that using variations of
the buddy system will not help the situation signicantly [PN77]. To address this
problem, Linux uses a
slab allocator
[Bon94] to carve up pages into small blocks of
memory for allocation [Tan01] which is discussed further in Chapter 8. With this
combination of allocators, the kernel can ensure that the amount of memory wasted
due to internal fragmentation is kept to a minimum.
6.6 What's New In 2.6
Allocating Pages
function
The rst noticeable dierence seems cosmetic at rst.
alloc_pages()
is now a macro and dened in
<linux/gfp.h>
The
instead of
6.6 What's New In 2.6
a function dened in
107
<linux/mm.h>.
The new layout is still very recognisable and
the main dierence is a subtle but important one. In 2.4, there was specic code
dedicated to selecting the correct node to allocate from based on the running CPU
but 2.6 removes this distinction between NUMA and UMA architectures.
In 2.6, the function
alloc_pages()
calls
numa_node_id()
to return the logical
ID of the node associated with the current running CPU. This NID is passed to
_alloc_pages()
which calls
NODE_DATA()
with the NID as a parameter. On UMA
contig_page_data being returned
array which NODE_DATA() uses NID as
architectures, this will unconditionally result in
but NUMA architectures instead set up an
an oset into. In other words, architectures are responsible for setting up a CPU
ID to NUMA memory node mapping. This is eectively still a node-local allocation
policy as is used in 2.4 but it is a lot more clearly dened.
Per-CPU Page Lists
The most important addition to the page allocation is the
addition of the per-cpu lists, rst discussed in Section 2.6.
In 2.4, a page allocation requires an interrupt safe spinlock to be held while the
struct per_cpu_pageset
(per_cpu_pageset→low) has not
allocation takes place. In 2.6, pages are allocated from a
by
buffered_rmqueue().
If the low watermark
been reached, the pages will be allocated from the pageset with no requirement for
a spinlock to be held. Once the low watermark is reached, a large number of pages
will be allocated in bulk with the interrupt safe spinlock held, added to the per-cpu
list and then one returned to the caller.
Higher order allocations, which are relatively rare, still require the interrupt safe
spinlock to be held and there will be no delay in the splits or coalescing. With 0
order allocations, splits will be delayed until the low watermark is reached in the
per-cpu set and coalescing will be delayed until the high watermark is reached.
However, strictly speaking, this is not a lazy buddy algorithm [BL89].
While
pagesets introduce a merging delay for order-0 allocations, it is a side-eect rather
than an intended feature and there is no method available to drain the pagesets and
merge the buddies. In other words, despite the per-cpu and new accounting code
which bulks up the amount of code in
mm/page_alloc.c,
the core of the buddy
algorithm remains the same as it was in 2.4.
The implication of this change is straight forward; the number of times the spinlock protecting the buddy lists must be acquired is reduced. Higher order allocations
are relatively rare in Linux so the optimisation is for the common case. This change
will be noticeable on large number of CPU machines but will make little dierence
to single CPUs. There are a few issues with pagesets but they are not recognised as
a serious problem. The rst issue is that high order allocations may fail if the pagesets hold order-0 pages that would normally be merged into higher order contiguous
blocks.
The second is that an order-0 allocation may fail if memory is low, the
current CPU pageset is empty and other CPU's pagesets are full, as no mechanism
exists for reclaiming pages from remote pagesets. The last potential problem is
that buddies of newly freed pages could exist in other pagesets leading to possible
fragmentation problems.
6.6 What's New In 2.6
Freeing Pages
pages called
108
Two new API function have been introduced for the freeing of
free_hot_page() and free_cold_page().
Predictably, the determine
if the freed pages are placed on the hot or cold lists in the per-cpu pagesets. However,
while the
free_cold_page()
is exported and available for use, it is actually never
called.
__free_pages() and frees resuling from page cache re__page_cache_release() are placed on the hot list where as higher order
allocations are freed immediately with __free_pages_ok(). Order-0 are usually reOrder-0 page frees from
leases by
lated to userspace and are the most common type of allocation and free. By keeping
them local to the CPU lock contention will be reduced as most allocations will also
be of order-0.
free_pages_bulk() or the pageset
hold all free pages. This free_pages_bulk() function takes a list of
allocations, the order of each block and the count number of blocks
Eventually, lists of pages must be passed to
lists would
page block
to free from the list.
There are two principal cases where this is used.
is higher order frees passed to
__free_pages_ok().
The rst
In this case, the page block is
placed on a linked list, of the specied order and a count of 1. The second case is
where the high watermark is reached in the pageset for the running CPU. In this
case, the pageset is passed, with an order of 0 and a count of
Once the core function
__free_pages_bulk()
pageset→batch.
is reached, the mechanisms for
freeing pages is to the buddy lists is very similar to 2.4.
GFP Flags
There are still only three zones, so the zone modiers remain the
same but three new GFP ags have been added that aect how hard the VM will
work, or not work, to satisfy a request. The ags are:
__GFP_NOFAIL
This ag is used by a caller to indicate that the allocation
should never fail and the allocator should keep trying to allocate indenitely.
__GFP_REPEAT
This ag is used by a caller to indicate that the request
should try to repeat the allocation if it fails. In the current implementation,
it behaves the same as
__GFP_NOFAIL but later the decision might be made to
fail after a while
__GFP_NORETRY
This ag is almost the opposite of
__GFP_NOFAIL.
It
indicates that if the allocation fails it should just return immediately.
At time of writing, they are not heavily used but they have just been introduced
and are likely to be used more over time. The
__GFP_REPEAT
ag in particular is
likely to be heavily used as blocks of code which implement this ags behaviour exist
throughout the kernel.
The next GFP ag that has been introduced is an allocation modier called
__GFP_COLD
lists.
which is used to ensure that cold pages are allocated from the per-cpu
From the perspective of the VM, the only user of this ag is the function
page_cache_alloc_cold()
which is mainly used during IO readahead.
page allocations will be taken from the hot pages list.
Usually
6.6 What's New In 2.6
The last new ag is
109
__GFP_NO_GROW.
This is an internal ag used only be the
slab allocator (discussed in Chapter 8) which aliases the ag to
SLAB_NO_GROW. It is
used to indicate when new slabs should never be allocated for a particular cache. In
reality, the GFP ag has just been introduced to complement the old
ag which is currently unused in the main kernel.
SLAB_NO_GROW
Chapter 7
Non-Contiguous Memory Allocation
It is preferable when dealing with large amounts of memory to use physically contiguous pages in memory both for cache related and memory access latency reasons.
Unfortunately, due to external fragmentation problems with the buddy allocator,
this is not always possible. Linux provides a mechanism via
vmalloc()
where non-
contiguous physically memory can be used that is contiguous in virtual memory.
An area is reserved in the virtual address space between
VMALLOC_END.
The location of
VMALLOC_START
VMALLOC_START
and
depends on the amount of available
physical memory but the region will always be at least
VMALLOC_RESERVE
in size,
which on the x86 is 128MiB. The exact size of the region is discussed in Section 4.1.
The page tables in this region are adjusted as necessary to point to physical
pages which are allocated with the normal physical page allocator. This means that
allocation must be a multiple of the hardware page size.
As allocations require
altering the kernel page tables, there is a limitation on how much memory can be
vmalloc() as only the virtual addresses space between VMALLOC_START
VMALLOC_END is available. As a result, it is used sparingly in the core kernel. In
2.4.22, it is only used for storing the swap map information (see Chapter 11) and
mapped with
and
for loading kernel modules into memory.
This small chapter begins with a description of how the kernel tracks which areas
in the vmalloc address space are used and how regions are allocated and freed.
7.1 Describing Virtual Memory Areas
The vmalloc address space is managed with a resource map allocator [Vah96]. The
struct vm_struct is responsible
<linux/vmalloc.h> as:
for storing the base,size pairs.
14 struct vm_struct {
15
unsigned long flags;
16
void * addr;
17
unsigned long size;
18
struct vm_struct * next;
19 };
110
It is dened in
7.2 Allocating A Non-Contiguous Area
111
A fully-edged VMA could have been used but it contains extra information that
does not apply to vmalloc areas and would be wasteful. Here is a brief description
of the elds in this small struct.
ags
These set either to VM_ALLOC, in the case of use with vmalloc() or
VM_IOREMAP when ioremap is used to map high memory into the kernel virtual
address space;
addr
size
This is the starting address of the memory block;
This is, predictably enough, the size in bytes;
next
vm_struct. They
vmlist_lock lock.
is a pointer to the next
list is protected by the
are ordered by address and the
As is clear, the areas are linked together via the
next
eld and are ordered by
address for simple searches. Each area is separated by at least one page to protect
against overruns. This is illustrated by the gaps in Figure 7.1.
Figure 7.1: vmalloc Address Space
When the kernel wishes to allocate a new area, the
linearly by the function
kmalloc().
get_vm_area().
vm_struct
list is searched
Space for the struct is allocated with
When the virtual area is used for remapping an area for IO (commonly
referred to as
ioremapping ), this function will be called directly to map the requested
area.
7.2 Allocating A Non-Contiguous Area
The functions
vmalloc(), vmalloc_dma()
and
vmalloc_32()
are provided to al-
locate a memory area that is contiguous in virtual address space. They all take a
single parameter
size
which is rounded up to the next page alignment. They all
return a linear address for the new allocated area.
As is clear from the call graph shown in Figure 7.2, there are two steps to
allocating the area. The rst step taken by
get_vm_area()
is to nd a region large
enough to store the request. It searches through a linear linked list of
vm_structs
and returns a new struct describing the allocated region.
vmalloc_area_pages(),
PMD entries with alloc_area_pmd() and PTE entries with alloc_area_pte() before nally allocating the page with alloc_page().
The second step is to allocate the necessary PGD entries with
7.2 Allocating A Non-Contiguous Area
112
Figure 7.2: Call Graph:
vmalloc()
void * vmalloc(unsigned long size)
Allocate a number of pages in vmalloc space that satisfy the requested size
void * vmalloc_dma(unsigned long size)
Allocate a number of pages from ZONE_DMA
void * vmalloc_32(unsigned long size)
Allocate memory that is suitable for 32 bit addressing. This ensures that the
physical page frames are in
ZONE_NORMAL
which 32 bit devices will require
Table 7.1: Non-Contiguous Memory Allocation API
vmalloc() is not the current process but the reference
page table stored at init_mm→pgd. This means that a process accessing the vmalloc
The page table updated by
area will cause a page fault exception as its page tables are not pointing to the correct
area. There is a special case in the page fault handling code which knows that the
fault occured in the vmalloc area and updates the current process page tables using
information from the master page table. How the use of
vmalloc()
relates to the
7.3 Freeing A Non-Contiguous Area
113
buddy allocator and page faulting is illustrated in Figure 7.3.
Figure 7.3: Relationship between
vmalloc(), alloc_page()
and Page Faulting
7.3 Freeing A Non-Contiguous Area
vfree() is responsible for freeing a virtual area. It linearly searches the
list of vm_structs looking for the desired region and then calls vmfree_area_pages()
The function
on the region of memory to be freed.
`
vmfree_area_pages()
is the exact opposite of
vmalloc_area_pages().
It
walks the page tables freeing up the page table entries and associated pages for the
region.
void vfree(void *addr)
Free a region of memory allocated with
vmalloc_32()
vmalloc(), vmalloc_dma()
Table 7.2: Non-Contiguous Memory Free API
or
7.4 Whats New in 2.6
114
Figure 7.4: Call Graph:
vfree()
7.4 Whats New in 2.6
Non-contiguous memory allocation remains essentially the same in 2.6. The main
dierence is a slightly dierent internal API which aects when the pages are allocated.
In 2.4,
vmalloc_area_pages()
is responsible for beginning a page ta-
ble walk and then allocating pages when the PTE is reached in the function
alloc_area_pte().
In 2.6, all the pages are allocated in advance by
and placed in an array which is passed to
map_vm_area()
__vmalloc()
for insertion into the
kernel page tables.
The
get_vm_area() API has changed very slightly.
When called, it behaves the
same as previously as it searches the entire vmalloc virtual address space for a free
area.
However, a caller can search just a subset of the vmalloc address space by
calling
__get_vm_area()
directly and specifying the range.
This is only used by
the ARM architecture when loading modules.
The last signicant change is the introduction of a new interface
vmap()
for the
insertion of an array of pages in the vmalloc address space and is only used by
the sound subsystem core. This interface was backported to 2.4.22 but it is totally
unused. It is either the result of an accidental backport or was merged to ease the
application of vendor-specic patches that require
vmap().
Chapter 8
Slab Allocator
In this chapter, the general-purpose allocator is described.
It is a slab allocator
which is very similar in many respects to the general kernel allocator used in Solaris [MM01].
Linux's implementation is heavily based on the rst slab allocator
paper by Bonwick [Bon94] with many improvements that bear a close resemblance
to those described in his later paper [BA01]. We will begin with a quick overview of
the allocator followed by a description of the dierent structures used before giving
an in-depth tour of each task the allocator is responsible for.
The basic idea behind the slab allocator is to have caches of commonly used
objects kept in an initialised state available for use by the kernel.
Without an
object based allocator, the kernel will spend much of its time allocating, initialising
and freeing the same object. The slab allocator aims to to cache the freed object so
that the basic structure is preserved between uses [Bon94].
The slab allocator consists of a variable number of caches that are linked together
on a doubly linked circular list called a
cache chain .
A cache, in the context of the
slab allocator, is a manager for a number of objects of a particular type like the
mm_struct or fs_cache cache and is managed by a struct kmem_cache_s discussed
next eld in the cache struct.
Each cache maintains blocks of contiguous pages in memory called slabs which are
in detail later. The caches are linked via the
carved up into small chunks for the data structures and objects the cache manages.
The relationship between these dierent structures is illustrated in Figure 8.1.
The slab allocator has three principle aims:
•
The allocation of small blocks of memory to help eliminate internal fragmentation that would be otherwise caused by the buddy system;
•
The caching of commonly used objects so that the system does not waste
time allocating, initialising and destroying objects.
Benchmarks on Solaris
showed excellent speed improvements for allocations with the slab allocator in
use [Bon94];
•
The better utilisation of hardware cache by aligning objects to the L1 or L2
caches.
115
CHAPTER 8. SLAB ALLOCATOR
116
Figure 8.1: Layout of the Slab Allocator
To help eliminate internal fragmentation normally caused by a binary buddy
5
allocator, two sets of caches of small memory buers ranging from 2 (32) bytes
17
to 2
(131072) bytes are maintained. One cache set is suitable for use with DMA
devices. These caches are called size-N and size-N(DMA) where
allocation, and a function
them.
kmalloc()
N
is the size of the
(see Section 8.4.1) is provided for allocating
With this, the single greatest problem with the low level page allocator is
addressed. The sizes caches are discussed in further detail in Section 8.4.
The second task of the slab allocator is to maintain caches of commonly used
objects.
For many structures used in the kernel, the time needed to initialise an
object is comparable to, or exceeds, the cost of allocating space for it.
When a
new slab is created, a number of objects are packed into it and initialised using a
constructor if available. When an object is freed, it is left in its initialised state so
that object allocation will be quick.
The nal task of the slab allocator is hardware cache utilization. If there is space
left over after objects are packed into a slab, the remaining space is used to
color
the
slab. Slab coloring is a scheme which attempts to have objects in dierent slabs use
dierent lines in the cache. By placing objects at a dierent starting oset within
the slab, it is likely that objects will use dierent lines in the CPU cache helping
ensure that objects from the same slab cache will be unlikely to ush each other.
CHAPTER 8. SLAB ALLOCATOR
117
With this scheme, space that would otherwise be wasted fullls a new function.
Figure 8.2 shows how a page allocated from the buddy allocator is used to store
objects that using coloring to align the objects to the L1 CPU cache.
Figure 8.2: Slab page containing Objects Aligned to L1 CPU Cache
Linux does not attempt to color page allocations based on their physical address [Kes91],
or order where objects are placed such as those described for
data [GAV95] or code segments [HK97] but the scheme used does help improve
cache line usage. Cache colouring is further discussed in Section 8.1.5. On an SMP
system, a further step is taken to help cache utilization where each cache has a small
array of objects reserved for each CPU. This is discussed further in Section 8.5.
The slab allocator provides the additional option of slab debugging if the option is
CONFIG_SLAB_DEBUG. Two debugging features are providing
called red zoning and object poisoning. With red zoning, a marker is placed at either
set at compile time with
end of the object. If this mark is disturbed, the allocator knows the object where
a buer overow occured and reports it.
predened bit pattern(dened
Poisoning an object will ll it with a
0x5A in mm/slab.c) at slab creation and after a free.
At allocation, this pattern is examined and if it is changed, the allocator knows that
the object was used before it was allocated and ags it.
The small, but powerful, API which the allocator exports is listed in Table 8.1.
8.1 Caches
118
kmem_cache_t * kmem_cache_create(const char *name, size_t size,
size_t offset, unsigned long flags,
void (*ctor)(void*, kmem_cache_t *, unsigned long),
void (*dtor)(void*, kmem_cache_t *, unsigned long))
Creates a new cache and adds it to the cache chain
int kmem_cache_reap(int gfp_mask)
Scans at most REAP_SCANLEN caches
and selects one for reaping all per-cpu
objects and free slabs from. Called when memory is tight
int kmem_cache_shrink(kmem_cache_t *cachep)
This function will delete all per-cpu objects associated with a cache and delete
all slabs in the
slabs_free
list. It returns the number of pages freed.
void * kmem_cache_alloc(kmem_cache_t *cachep, int flags)
Allocate a single object from the cache and return it to the caller
void kmem_cache_free(kmem_cache_t *cachep, void *objp)
Free an object and return it to the cache
void * kmalloc(size_t size, int flags)
Allocate a block of memory from one of the sizes cache
void kfree(const void *objp)
Free a block of memory allocated with kmalloc
int kmem_cache_destroy(kmem_cache_t * cachep)
Destroys all objects in all slabs and frees up all associated memory before
removing the cache from the chain
Table 8.1: Slab Allocator API for caches
8.1 Caches
One cache exists for each type of object that is to be cached. For a full list of caches
available on a running system, run
cat /proc/slabinfo .
This le gives some basic
information on the caches. An excerpt from the output of this le looks like;
8.1 Caches
119
slabinfo - version:
kmem_cache
urb_priv
tcp_bind_bucket
inode_cache
dentry_cache
mm_struct
vm_area_struct
size-64(DMA)
size-64
size-32(DMA)
size-32
1.1 (SMP)
80
80
0
0
15
226
5714
5992
5160 5160
240
240
3911 4480
0
0
432 1357
17
113
850 2712
248
5
5
64
0
0
32
2
2
512 856 856
128 172 172
160 10 10
96 112 112
64
0
0
64
23 23
32
1
1
32
24 24
Each of the column elds correspond to a eld in the
1
1
1
1
1
1
1
1
1
1
1
:
:
:
:
:
:
:
:
:
:
:
252
252
252
124
252
252
252
252
252
252
252
126
126
126
62
126
126
126
126
126
126
126
struct kmem_cache_s
structure. The columns listed in the excerpt above are:
cache-name
A human readable name such as tcp_bind_bucket;
num-active-objs
total-objs
obj-size
Number of objects that are in use;
How many objects are available in total including unused;
The size of each object, typically quite small;
num-active-slabs
total-slabs
Number of slabs containing objects that are active;
How many slabs in total exist;
num-pages-per-slab
The pages required to create one slab, typically 1.
If SMP is enabled like in the example excerpt, two more columns will be displayed
after a colon.
They refer to the per CPU cache described in Section 8.5.
The
columns are:
limit
This is the number of free objects the pool can have before half of it is given
to the global free pool;
batchcount
The number of objects allocated for the processor in a block when
no objects are free.
To speed allocation and freeing of objects and slabs they are arranged into three
slabs_full, slabs_partial and slabs_free. slabs_full has all its objects
slabs_partial has free objects in it and so is a prime candidate for allocation
of objects. slabs_free has no allocated objects and so is a prime candidate for slab
lists;
in use.
destruction.
8.1.1 Cache Descriptor
120
8.1.1 Cache Descriptor
All information describing a cache is stored in a
mm/slab.c.
struct kmem_cache_s
declared in
This is an extremely large struct and so will be described in parts.
190 struct kmem_cache_s {
193
struct list_head
194
struct list_head
195
struct list_head
196
unsigned int
197
unsigned int
198
unsigned int
199
spinlock_t
200 #ifdef CONFIG_SMP
201
unsigned int
202 #endif
203
slabs_full;
slabs_partial;
slabs_free;
objsize;
flags;
num;
spinlock;
batchcount;
Most of these elds are of interest when allocating or freeing objects.
slabs_*
These are the three lists where the slabs are stored as described in the
previous section;
objsize
ags
This is the size of each object packed into the slab;
These ags determine how parts of the allocator will behave when dealing
with the cache. See Section 8.1.2;
num
This is the number of objects contained in each slab;
spinlock
A spinlock protecting the structure from concurrent accessses;
batchcount
This is the number of objects that will be allocated in batch for the
per-cpu caches as described in the previous section.
206
209
210
211
212
213
214
215
216
217
219
222
unsigned int
unsigned int
gfporder;
gfpflags;
size_t
unsigned int
unsigned int
kmem_cache_t
unsigned int
unsigned int
colour;
colour_off;
colour_next;
*slabp_cache;
growing;
dflags;
void (*ctor)(void *, kmem_cache_t *, unsigned long);
void (*dtor)(void *, kmem_cache_t *, unsigned long);
8.1.1 Cache Descriptor
223
224
225
unsigned long
121
failures;
This block deals with elds of interest when allocating or freeing slabs from the
cache.
gfporder
This indicates the size of the slab in pages. Each slab consumes
2gfporder
pages as these are the allocation sizes the buddy allocator provides;
gfpags
The GFP ags used when calling the buddy allocator to allocate pages
are stored here. See Section 6.4 for a full list;
colour
Each slab stores objects in dierent cache lines if possible. Cache colouring
will be further discussed in Section 8.1.5;
colour_o
This is the byte alignment to keep slabs at. For example, slabs for
the size-X caches are aligned on the L1 cache;
colour_next
it reaches
growing
This is the next colour line to use. This value wraps back to 0 when
colour;
This ag is set to indicate if the cache is growing or not. If it is, it is
much less likely this cache will be selected to reap free slabs under memory
pressure;
dags
These are the dynamic ags which change during the cache lifetime. See
Section 8.1.3;
ctor
A complex object has the option of providing a constructor function to be
called to initialise each new object. This is a pointer to that function and may
be NULL;
dtor
This is the complementing object destructor and may be NULL;
failures
This eld is not used anywhere in the code other than being initialised
to 0.
227
228
char
struct list_head
name[CACHE_NAMELEN];
next;
These are set during cache creation
name
next
This is the human readable name of the cache;
This is the next cache on the cache chain.
229 #ifdef CONFIG_SMP
231
cpucache_t
232 #endif
*cpudata[NR_CPUS];
8.1.1 Cache Descriptor
cpudata
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
122
This is the per-cpu data and is discussed further in Section 8.5.
#if STATS
unsigned long
unsigned long
unsigned long
unsigned long
unsigned long
unsigned long
#ifdef CONFIG_SMP
atomic_t
atomic_t
atomic_t
atomic_t
#endif
#endif
};
num_active;
num_allocations;
high_mark;
grown;
reaped;
errors;
allochit;
allocmiss;
freehit;
freemiss;
These gures are only available if the
CONFIG_SLAB_DEBUG
option is set during
compile time. They are all beancounters and not of general interest. The statistics
for
/proc/slabinfo
are calculated when the proc entry is read by another process
by examining every slab used by each cache rather than relying on these elds to be
available.
num_active
The current number of active objects in the cache is stored here;
num_allocations
A running total of the number of objects that have been allo-
cated on this cache is stored in this eld;
high_mark
grown
This is the number of times
reaped
errors
This is the highest value
num_active
has had to date;
kmem_cache_grow()
has been called;
The number of times this cache has been reaped is kept here;
This eld is never used;
allochit
This is the total number of times an allocation has used the per-cpu
cache;
allocmiss
To complement
allochit,
this is the number of times an allocation
has missed the per-cpu cache;
freehit
This is the number of times a free was placed on a per-cpu cache;
freemiss
pool.
This is the number of times an object was freed and placed on the global
8.1.2 Cache Static Flags
123
8.1.2 Cache Static Flags
A number of ags are set at cache creation time that remain the same for the
lifetime of the cache. They aect how the slab is structured and how objects are
flags
stored within it.
All the ags are stored in a bitmask in the
eld of the
cache descriptor.
The full list of possible ags that may be used are declared in
<linux/slab.h>.
There are three principle sets. The rst set is internal ags which are set only
by the slab allocator and are listed in Table 8.2. The only relevant ag in the set is
the
CFGS_OFF_SLAB
ag which determines where the slab descriptor is stored.
Flag
Description
CFGS_OFF_SLAB
Indicates that the slab managers for this cache are
CFLGS_OPTIMIZE
This ag is only ever set and never used
kept o-slab. This is discussed further in Section 8.2.1
Table 8.2: Internal cache static ags
The second set are set by the cache creator and they determine how the allocator
treats the slab and how objects are stored. They are listed in Table 8.3.
Flag
SLAB_HWCACHE_ALIGN
SLAB_MUST_HWCACHE_ALIGN
Description
Align the objects to the L1 CPU cache
Force alignment to the L1 CPU cache even
if it is very wasteful or slab debugging is
enabled
SLAB_NO_REAP
SLAB_CACHE_DMA
Never reap slabs in this cache
Allocate
slabs
with
memory
from
ZONE_DMA
Table 8.3: Cache static ags set by caller
The last ags are only available if the compile option
CONFIG_SLAB_DEBUG is set.
They determine what additional checks will be made to slabs and objects and are
primarily of interest only when new caches are being developed.
To prevent callers using the wrong ags a
CREATE_MASK
is dened in
mm/slab.c
consisting of all the allowable ags. When a cache is being created, the requested
ags are compared against the
CREATE_MASK
and reported as a bug if invalid ags
are used.
8.1.3 Cache Dynamic Flags
dflags eld has only one ag, DFLGS_GROWN, but it is important. The ag is set
during kmem_cache_grow() so that kmem_cache_reap() will be unlikely to choose
The
8.1.4 Cache Allocation Flags
Flag
SLAB_DEBUG_FREE
SLAB_DEBUG_INITIAL
124
Description
Perform expensive checks on free
On free, call the constructor as a verier to ensure the object is still initialised correctly
SLAB_RED_ZONE
This places a marker at either end of objects to
trap overows
SLAB_POISON
Poison objects with a known pattern for trapping changes made to objects not allocated or
initialised
Table 8.4: Cache static debug ags
the cache for reaping.
When the function does nd a cache with this ag set, it
skips the cache and removes the ag.
8.1.4 Cache Allocation Flags
These ags correspond to the GFP page ag options for allocating pages for slabs.
Callers sometimes call with either
only
SLAB_*
SLAB_* or GFP_* ags, but they really should use
ags. They correspond directly to the ags described in Section 6.4 so
will not be discussed in detail here. It is presumed the existence of these ags are
for clarity and in case the slab allocator needed to behave dierently in response to
a particular ag but in reality, there is no dierence.
Flag
SLAB_ATOMIC
SLAB_DMA
SLAB_KERNEL
SLAB_NFS
SLAB_NOFS
SLAB_NOHIGHIO
SLAB_NOIO
SLAB_USER
Description
Equivalent to
Equivalent to
Equivalent to
Equivalent to
Equivalent to
Equivalent to
Equivalent to
Equivalent to
GFP_ATOMIC
GFP_DMA
GFP_KERNEL
GFP_NFS
GFP_NOFS
GFP_NOHIGHIO
GFP_NOIO
GFP_USER
Table 8.5: Cache Allocation Flags
A very small number of ags may be passed to constructor and destructor functions which are listed in Table 8.6.
8.1.5 Cache Colouring
To utilise hardware cache better, the slab allocator will oset objects in dierent
slabs by dierent amounts depending on the amount of space left over in the slab.
The oset is in units of
BYTES_PER_WORD unless SLAB_HWCACHE_ALIGN is set in which
8.1.6 Cache Creation
125
Flag
SLAB_CTOR_CONSTRUCTOR
Description
Set if the function is being called as a constructor for
caches which use the same function as a constructor
and a destructor
SLAB_CTOR_ATOMIC
SLAB_CTOR_VERIFY
Indicates that the constructor may not sleep
Indicates that the constructor should just verify the
object is initialised correctly
Table 8.6: Cache Constructor Flags
case it is aligned to blocks of
L1_CACHE_BYTES
for alignment to the L1 hardware
cache.
During cache creation, it is calculated how many objects can t on a slab (see
Section 8.2.7) and how many bytes would be wasted. Based on wastage, two gures
are calculated for the cache descriptor
colour
This is the number of dierent osets that can be used;
colour_o
This is the multiple to oset each objects by in the slab.
With the objects oset, they will use dierent lines on the associative hardware
cache. Therefore, objects from slabs are less likely to overwrite each other in memory.
The result of this is best explained by an example. Let us say that
s_mem
(the
address of the rst object) on the slab is 0 for convenience, that 100 bytes are
wasted on the slab and alignment is to be at 32 bytes to the L1 Hardware Cache on
a Pentium II.
In this scenario, the rst slab created will have its objects start at 0. The second
will start at 32, the third at 64, the fourth at 96 and the fth will start back at 0.
With this, objects from each of the slabs will not hit the same hardware cache line
on the CPU. The value of
colour
is 3 and
colour_off
is 32.
8.1.6 Cache Creation
The function
kmem_cache_create()
is responsible for creating new caches and
adding them to the cache chain. The tasks that are taken to create a cache are
•
Perform basic sanity checks for bad usage;
•
Perform debugging checks if
•
Allocate a
•
Align the object size to the word size;
•
Calculate how many objects will t on a slab;
•
Align the object size to the hardware cache;
kmem_cache_t
CONFIG_SLAB_DEBUG
from the
cache_cache
is set;
slab cache ;
8.1.7 Cache Reaping
126
•
Calculate colour osets ;
•
Initialise remaining elds in cache descriptor;
•
Add the new cache to the cache chain.
Figure 8.3 shows the call graph relevant to the creation of a cache; each function
is fully described in the Code Commentary.
Figure 8.3: Call Graph:
kmem_cache_create()
8.1.7 Cache Reaping
When a slab is freed, it is placed on the
slabs_free
not automatically shrink themselves so when
it calls
kmem_cache_reap()
list for future use. Caches do
kswapd notices that memory is tight,
to free some memory. This function is responsible for
selecting a cache that will be required to shrink its memory usage. It is worth noting
that cache reaping does not take into account what memory node or zone is under
pressure. This means that with a NUMA or high memory machine, it is possible
the kernel will spend a lot of time freeing memory from regions that are under no
memory pressure but this is not a problem for architectures like the x86 which has
only one bank of memory.
The call graph in Figure 8.4 is deceptively simple as the task of selecting the
proper cache to reap is quite long.
in the system, only
In the event that there are numerous caches
REAP_SCANLEN(currently
dened as 10) caches are examined in
each call. The last cache to be scanned is stored in the variable
clock_searchp
so
as not to examine the same caches repeatedly. For each scanned cache, the reaper
does the following
SLAB_NO_REAP
•
Check ags for
•
If the cache is growing, skip it;
•
if the cache has grown recently or is current growing,
and skip if set;
DFLGS_GROWN will be set.
If this ag is set, the slab is skipped but the ag is cleared so it will be a reap
canditate the next time;
8.1.8 Cache Shrinking
127
Figure 8.4: Call Graph:
•
slabs_free
pages;
Count the number of free slabs in
that would free in the variable
•
kmem_cache_reap()
and calculate how many pages
If the cache has constructors or large slabs, adjust
pages to make it less likely
for the cache to be selected;
•
If the number of pages that would be freed exceeds
of the slabs in
•
slabs_free;
REAP_PERFECT,
free half
Otherwise scan the rest of the caches and select the one that would free the
slabs_free.
most pages for freeing half of its slabs in
8.1.8 Cache Shrinking
When a cache is selected to shrink itself, the steps it takes are simple and brutal
•
Delete all objects in the per CPU caches;
•
Delete all slabs from
slabs_free
unless the growing ag gets set.
Linux is nothing, if not subtle.
Two varieties of shrink functions are provided with confusingly similar names.
kmem_cache_shrink()
removes all slabs from
slabs_free
and returns the number
of pages freed as a result. This is the principal function exported for use by the slab
allocator users.
__kmem_cache_shrink() frees all slabs from slabs_free
slabs_partial and slabs_full are empty. This is for inter-
The second function
and then veries that
nal use only and is important during cache destruction when it doesn't matter how
many pages are freed, just that the cache is empty.
8.1.9 Cache Destroying
128
Figure 8.5: Call Graph:
Figure 8.6: Call Graph:
kmem_cache_shrink()
__kmem_cache_shrink()
8.1.9 Cache Destroying
When a module is unloaded, it is responsible for destroying any cache with the function
kmem_cache_destroy().
It is important that the cache is properly destroyed as
two caches of the same human-readable name are not allowed to exist. Core kernel
code often does not bother to destroy its caches as their existence persists for the
life of the system. The steps taken to destroy a cache are
•
Delete the cache from the cache chain;
•
Shrink the cache to delete all slabs;
•
Free any per CPU caches (kfree());
•
Delete the cache descriptor from the
cache_cache.
8.2 Slabs
129
Figure 8.7: Call Graph:
kmem_cache_destroy()
8.2 Slabs
This section will describe how a slab is structured and managed. The struct which
describes it is much simpler than the cache descriptor, but how the slab is arranged
is considerably more complex. It is declared as follows:
typedef struct slab_s {
struct list_head
unsigned long
void
unsigned int
kmem_bufctl_t
} slab_t;
list;
colouroff;
*s_mem;
inuse;
free;
The elds in this simple struct are as follows:
list
This is the linked list the slab belongs to.
slab_partial
colouro
or
slab_free
inuse
free
slab_full,
This is the colour oset from the base address of the rst object within
the slab. The address of the rst object is
s_mem
This will be one of
from the cache manager;
s_mem + colouroff;
This gives the starting address of the rst object within the slab;
This gives the number of active objects in the slab;
This is an array of
bufctls
used for storing locations of free objects.
See
Section 8.2.3 for further details.
The reader will note that given the slab manager or an object within the slab,
there does not appear to be an obvious way to determine what slab or cache they
list eld in the struct page that makes
SET_PAGE_SLAB() use the next and prev
belong to. This is addressed by using the
SET_PAGE_CACHE() and
page→list to track what cache and slab an object belongs to. To get
the descriptors from the page, the macros GET_PAGE_CACHE() and GET_PAGE_SLAB()
up the cache.
elds on the
are available. This set of relationships is illustrated in Figure 8.8.
The last issue is where the slab management struct is kept. Slab managers are
kept either on (CFLGS_OFF_SLAB set in the static ags) or o-slab.
Where they
8.2.1 Storing the Slab Descriptor
130
Figure 8.8: Page to Cache and Slab Relationship
are placed are determined by the size of the object during cache creation.
important to note that in 8.8, the
struct slab_t
of the page frame although the gure implies the
It is
could be stored at the beginning
struct slab_
is seperate from
the page frame.
8.2.1 Storing the Slab Descriptor
If the objects are larger than a threshold (512 bytes on x86),
in the cache ags and the
slab descriptor
CFGS_OFF_SLAB
Section 8.4). The selected sizes cache is large enough to contain the
and
is set
is kept o-slab in one of the sizes cache (see
kmem_cache_slabmgmt() allocates from it as necessary.
struct slab_t
This limits the number
of objects that can be stored on the slab because there is limited space for the
bufctls but that is unimportant as the objects are large and so there should not be
many stored in a single slab.
Alternatively, the slab manager is reserved at the beginning of the slab. When
stored on-slab, enough space is kept at the beginning of the slab to store both the
slab_t
and the
kmem_bufctl_t
which is an array of unsigned integers. The array
is responsible for tracking the index of the next free object that is available for use
which is discussed further in Section 8.2.3. The actual objects are stored after the
kmem_bufctl_t
array.
Figure 8.9 should help clarify what a slab with the descriptor on-slab looks like
and Figure 8.10 illustrates how a cache uses a sizes cache to store the slab descriptor
when the descriptor is kept o-slab.
8.2.2 Slab Creation
131
Figure 8.9: Slab With Descriptor On-Slab
8.2.2 Slab Creation
At this point, we have seen how the cache is created, but on creation, it is an
empty cache with empty lists for its
slab_full, slab_partial and slabs_free.
kmem_cache_grow().
New slabs are allocated to a cache by calling the function
This is frequently called cache growing and occurs when no objects are left in the
slabs_partial
list and there are no slabs in
slabs_free.
The tasks it fullls are
•
Perform basic sanity checks to guard against bad usage;
•
Calculate colour oset for objects in this slab;
•
Allocate memory for slab and acquire a slab descriptor;
•
Link the pages used for the slab to the slab and cache descriptors described in
Section 8.2;
•
Initialise objects in the slab;
•
Add the slab to the cache.
8.2.3 Tracking Free Objects
The slab allocator has got to have a quick and simple means of tracking where
free objects are on the partially lled slabs. It achieves this by using an array of
unsigned integers called
kmem_bufctl_t
that is associated with each slab manager
as obviously it is up to the slab manager to know where its free objects are.
8.2.3 Tracking Free Objects
132
Figure 8.10: Slab With Descriptor O-Slab
Figure 8.11: Call Graph:
kmem_cache_grow()
Historically, and according to the paper describing the slab allocator [Bon94],
kmem_bufctl_t
was a linked list of objects. In Linux 2.2.x, this struct was a union
of three items, a pointer to the next free object, a pointer to the slab manager and
a pointer to the object. Which it was depended on the state of the object.
Today, the slab and cache an object belongs to is determined by the
and
kmem_bufctl_t
is simply an integer array of object indices.
struct page
The number of
elements in the array is the same as the number of objects on the slab.
141 typedef unsigned int kmem_bufctl_t;
As the array is kept after the slab descriptor and there is no pointer to the rst
element directly, a helper macro
slab_bufctl()
163 #define slab_bufctl(slabp) \
is provided.
8.2.4 Initialising the kmem_bufctl_t Array
164
133
((kmem_bufctl_t *)(((slab_t*)slabp)+1))
This seemingly cryptic macro is quite simple when broken down. The parameter
slabp
slabp
((slab_t*)slabp)+1 casts
to it. This will give a pointer to a slab_t
kmem_bufctl_t array. (kmem_bufctl_t *)
is a pointer to the slab manager. The expression
to a
slab_t
struct and adds 1
which is actually the beginning of the
slab_t pointer to the required type. The results in blocks of code that
contain slab_bufctl(slabp)[i]. Translated, that says take a pointer to a slab
descriptor, oset it with slab_bufctl() to the beginning of the kmem_bufctl_t
array and return the ith element of the array.
The index to the next free object in the slab is stored in slab_t→free elimicasts the
nating the need for a linked list to track free objects. When objects are allocated or
freed, this pointer is updated based on information in the
kmem_bufctl_t
array.
8.2.4 Initialising the kmem_bufctl_t Array
When a cache is grown, all the objects and the
kmem_bufctl_t
array on the slab
are initialised. The array is lled with the index of each object beginning with 1
and ending with the marker
BUFCTL_END.
For a slab with 5 objects, the elements of
the array would look like Figure 8.12.
Figure 8.12: Initialised
kmem_bufctl_t
Array
slab_t→free as the 0th object is the rst free object to
n, the index of the next free object will
kmem_bufctl_t[n]. Looking at the array above, the next object free
The value 0 is stored in
be used. The idea is that for a given object
be stored in
after 0 is 1. After 1, there are two and so on. As the array is used, this arrangement
will make the array act as a LIFO for free objects.
8.2.5 Finding the Next Free Object
kmem_cache_alloc() performs the real work of updating the kmem_bufctl_t() array by calling kmem_cache_alloc_one_tail(). The
eld slab_t→free has the index of the rst free object. The index of the next free
object is at kmem_bufctl_t[slab_t→free]. In code terms, this looks like
When allocating an object,
1253
1254
objp = slabp->s_mem + slabp->free*cachep->objsize;
slabp->free=slab_bufctl(slabp)[slabp->free];
The eld
slabp→s_mem is a pointer to the rst object on the slab. slabp→free
is the index of the object to allocate and it has to be multiplied by the size of an
object.
8.2.6 Updating kmem_bufctl_t
134
The index of the next free object is stored at
kmem_bufctl_t[slabp→free].
slab_bufctl()
There is no pointer directly to the array hence the helper macro
is used. Note that the
kmem_bufctl_t
array is not changed during allocations but
that the elements that are unallocated are unreachable.
allocations, index 0 and 1 of the
kmem_bufctl_t
For example, after two
array are not pointed to by any
other element.
8.2.6 Updating kmem_bufctl_t
kmem_bufctl_t list is only updated when
kmem_cache_free_one(). The array is updated
The
1451
1452
1453
1454
an object is freed in the function
with this block of code:
unsigned int objnr = (objp-slabp->s_mem)/cachep->objsize;
slab_bufctl(slabp)[objnr] = slabp->free;
slabp->free = objnr;
objp is the object about to
kmem_bufctl_t[objnr] is updated to point to
The pointer
eectively placing the object pointed to by
slabp→free
be freed and
objnr
the current value of
free
is its index.
slabp→free,
on the pseudo linked list.
is updated to the object being freed so that it will be the next one
allocated.
8.2.7 Calculating the Number of Objects on a Slab
During cache creation, the function
kmem_cache_estimate()
is called to calculate
how many objects may be stored on a single slab taking into account whether the
slab descriptor must be stored on-slab or o-slab and the size of each
kmem_bufctl_t
needed to track if an object is free or not. It returns the number of objects that
may be stored and how many bytes are wasted.
The number of wasted bytes is
important if cache colouring is to be used.
The calculation is quite basic and takes the following steps
wastage
PAGE_SIZEgfp_order ;
•
Initialise
•
Subtract the amount of space required to store the slab descriptor;
•
Count up the number of objects that may be stored. Include the size of the
to be the total size of the slab i.e.
kmem_bufctl_t if the slab descriptor
the size of i until the slab is lled;
•
is stored on the slab. Keep increasing
Return the number of objects and bytes wasted.
8.2.8 Slab Destroying
135
8.2.8 Slab Destroying
When a cache is being shrunk or destroyed, the slabs will be deleted. As the objects
may have destructors, these must be called, so the tasks of this function are:
•
If available, call the destructor for every object in the slab;
•
If debugging is enabled, check the red marking and poison pattern;
•
Free the pages the slab uses.
The call graph at Figure 8.13 is very simple.
Figure 8.13: Call Graph:
kmem_slab_destroy()
8.3 Objects
This section will cover how objects are managed. At this point, most of the really
hard work has been completed by either the cache or slab managers.
8.3.1 Initialising Objects in a Slab
When a slab is created, all the objects in it are put in an initialised state.
If a
constructor is available, it is called for each object and it is expected that objects
are left in an initialised state upon free. Conceptually the initialisation is very simple,
kmem_bufctl for
kmem_cache_init_objs() is responsible for initialising the objects.
cycle through all objects and call the constructor and initialise the
it. The function
8.3.2 Object Allocation
The function
kmem_cache_alloc()
is responsible for allocating one object to the
caller which behaves slightly dierent in the UP and SMP cases. Figure 8.14 shows
the basic call graph that is used to allocate an object in the SMP case.
There are four basic steps. The rst step (kmem_cache_alloc_head()) covers
basic checking to make sure the allocation is allowable. The second step is to select
which slabs list to allocate from. This will be one of
If there are no slabs in
slabs_free,
slabs_partial or slabs_free.
the cache is grown (see Section 8.2.2) to create
8.3.3 Object Freeing
136
Figure 8.14: Call Graph:
a new slab in
slabs_free.
kmem_cache_alloc()
The nal step is to allocate the object from the selected
slab.
The SMP case takes one further step. Before allocating one object, it will check
to see if there is one available from the per-CPU cache and will use it if there is. If
there is not, it will allocate
batchcount
number of objects in bulk and place them
in its per-cpu cache. See Section 8.5 for more information on the per-cpu caches.
8.3.3 Object Freeing
kmem_cache_free() is used to free objects and it has a relatively simple task.
kmem_cache_alloc(), it behaves dierently in the UP and SMP cases.
like
Just
The
principal dierence between the two cases is that in the UP case, the object is
returned directly to the slab but with the SMP case, the object is returned to the
per-cpu cache. In both cases, the destructor for the object will be called if one is
available.
The destructor is responsible for returning the object to the initialised
state.
Figure 8.15: Call Graph:
kmem_cache_free()
8.4 Sizes Cache
137
8.4 Sizes Cache
Linux keeps two sets of caches for small memory allocations for which the physical
page allocator is unsuitable. One set is for use with DMA and the other is suitable
for normal use. The human readable names for these caches are
size-N cache
and
/proc/slabinfo. Information for each
sized cache is stored in a struct cache_sizes, typedeed to cache_sizes_t, which
is dened in mm/slab.c as:
size-N(DMA) cache
which are viewable from
331 typedef struct cache_sizes {
332
size_t
cs_size;
333
kmem_cache_t
*cs_cachep;
334
kmem_cache_t
*cs_dmacachep;
335 } cache_sizes_t;
The elds in this struct are described as follows:
cs_size The size of the memory block;
cs_cachep The cache of blocks for normal memory use;
cs_dmacachep The cache of blocks for use with DMA.
As there are a limited number of these caches that exist, a static array called
cache_sizes
is initialised at compile time beginning with 32 bytes on a 4KiB ma-
chine and 64 for greater page sizes.
337 static cache_sizes_t cache_sizes[] = {
338 #if PAGE_SIZE == 4096
339
{
32,
NULL, NULL},
340 #endif
341
{
64,
NULL, NULL},
342
{
128,
NULL, NULL},
343
{
256,
NULL, NULL},
344
{
512,
NULL, NULL},
345
{ 1024,
NULL, NULL},
346
{ 2048,
NULL, NULL},
347
{ 4096,
NULL, NULL},
348
{ 8192,
NULL, NULL},
349
{ 16384,
NULL, NULL},
350
{ 32768,
NULL, NULL},
351
{ 65536,
NULL, NULL},
352
{131072,
NULL, NULL},
353
{
0,
NULL, NULL}
As is obvious, this is a static array that is zero terminated consisting of buers
5
17
of succeeding powers of 2 from 2 to 2
. An array now exists that describes each
sized cache which must be initialised with caches at system startup.
8.4.1 kmalloc()
8.4.1
138
kmalloc()
With the existence of the sizes cache, the slab allocator is able to oer a new allocator
function,
kmalloc()
for use when small memory buers are required.
When a
request is received, the appropriate sizes cache is selected and an object assigned
from it. The call graph on Figure 8.16 is therefore very simple as all the hard work
is in cache allocation.
Figure 8.16: Call Graph:
8.4.2
kmalloc()
kfree()
kmalloc() function to allocate small memory objects for use, there
kfree() for freeing it. As with kmalloc(), the real work takes place during
Just as there is a
is a
object freeing (See Section 8.3.3) so the call graph in Figure 8.17 is very simple.
Figure 8.17: Call Graph:
kfree()
8.5 Per-CPU Object Cache
One of the tasks the slab allocator is dedicated to is improved hardware cache
utilization.
An aim of high performance computing [CS98] in general is to use
data on the same CPU for as long as possible.
Linux achieves this by trying to
keep objects in the same CPU cache with a Per-CPU object cache, simply called a
cpucache
for each CPU in the system.
8.5.1 Describing the Per-CPU Object Cache
139
When allocating or freeing objects, they are placed in the cpucache. When there
are no objects free, a
batch
of objects is placed into the pool. When the pool gets
too large, half of them are removed and placed in the global cache. This way the
hardware cache will be used for as long as possible on the same CPU.
The second major benet of this method is that spinlocks do not have to be held
when accessing the CPU pool as we are guaranteed another CPU won't access the
local data. This is important because without the caches, the spinlock would have
to be acquired for every allocation and free which is unnecessarily expensive.
8.5.1 Describing the Per-CPU Object Cache
Each cache descriptor has a pointer to an array of cpucaches, described in the cache
descriptor as
231
cpucache_t
*cpudata[NR_CPUS];
This structure is very simple
173 typedef struct cpucache_s {
174
unsigned int avail;
175
unsigned int limit;
176 } cpucache_t;
The elds are as follows:
avail
This is the number of free objects available on this cpucache;
limit
This is the total number of free objects that can exist.
A helper macro
cc_data()
is provided to give the cpucache for a given cache
and processor. It is dened as
180 #define cc_data(cachep) \
181
((cachep)->cpudata[smp_processor_id()])
This will take a given cache descriptor (cachep) and return a pointer from the
cpucache array (cpudata).
The index needed is the ID of the current processor,
smp_processor_id().
Pointers to objects on the cpucache are placed immediately after the
cpucache_t
struct. This is very similar to how objects are stored after a slab descriptor.
8.5.2 Adding/Removing Objects from the Per-CPU Cache
140
8.5.2 Adding/Removing Objects from the Per-CPU Cache
To prevent fragmentation, objects are always added or removed from the end of the
array. To add an object (obj) to the CPU cache (cc), the following block of code is
used
cc_entry(cc)[cc->avail++] = obj;
To remove an object
obj = cc_entry(cc)[--cc->avail];
There is a helper macro called
cc_entry()
which gives a pointer to the rst
object in the cpucache. It is dened as
178 #define cc_entry(cpucache) \
179
((void **)(((cpucache_t*)(cpucache))+1))
This takes a pointer to a cpucache, increments the value by the size of the
cpucache_t
descriptor giving the rst object in the cache.
8.5.3 Enabling Per-CPU Caches
When a cache is created, its CPU cache has to be enabled and memory allocated for
enable_cpucache() is responsible for deciding
what size to make the cache and calling kmem_tune_cpucache() to allocate memory
it using
kmalloc().
The function
for it.
Obviously a CPU cache cannot exist until after the various sizes caches have
been enabled so a global variable
g_cpucache_up is used to prevent CPU caches
enable_all_cpucaches() cycles through
being enabled prematurely. The function
all caches in the cache chain and enables their cpucache.
Once the CPU cache has been setup, it can be accessed without locking as a
CPU will never access the wrong cpucache so it is guaranteed safe access to it.
8.5.4 Updating Per-CPU Information
When the per-cpu caches have been created or changed, each CPU is signalled via
an IPI. It is not sucient to change all the values in the cache descriptor as that
would lead to cache coherency issues and spinlocks would have to used to protect
the CPU caches. Instead a
ccupdate_t struct is populated with all the information
each CPU needs and each CPU swaps the new data with the old information in the
cache descriptor. The struct for storing the new cpucache information is dened as
follows
868 typedef struct ccupdate_struct_s
869 {
8.5.5 Draining a Per-CPU Cache
141
870
kmem_cache_t *cachep;
871
cpucache_t *new[NR_CPUS];
872 } ccupdate_struct_t;
cachep
is the cache being updated and
new
is the array of the cpucache de-
smp_function_all_cpus() is
used to get each CPU to call the do_ccupdate_local() function which swaps the
information from ccupdate_struct_t with the information in the cache descriptor.
scriptors for each CPU on the system. The function
Once the information has been swapped, the old data can be deleted.
8.5.5 Draining a Per-CPU Cache
When a cache is being shrunk, its rst step is to drain the cpucaches of any objects
they might have by calling
drain_cpu_caches().
This is so that the slab allocator
will have a clearer view of what slabs can be freed or not. This is important because
if just one object in a slab is placed in a per-cpu cache, that whole slab cannot be
freed. If the system is tight on memory, saving a few milliseconds on allocations has
a low priority.
8.6 Slab Allocator Initialisation
Here we will describe how the slab allocator initialises itself. When the slab allocator creates a new cache, it allocates the
kmem_cache cache.
kmem_cache_t
from the
cache_cache or
cache_cache
This is an obvious chicken and egg problem so the
has to be statically initialised as
357 static kmem_cache_t
358
slabs_full:
359
slabs_partial:
360
slabs_free:
361
objsize:
362
flags:
363
spinlock:
364
colour_off:
365
name:
366 };
cache_cache = {
LIST_HEAD_INIT(cache_cache.slabs_full),
LIST_HEAD_INIT(cache_cache.slabs_partial),
LIST_HEAD_INIT(cache_cache.slabs_free),
sizeof(kmem_cache_t),
SLAB_NO_REAP,
SPIN_LOCK_UNLOCKED,
L1_CACHE_BYTES,
"kmem_cache",
This code statically initialised the
kmem_cache_t
struct as follows:
358-360 Initialise the three lists as empty lists;
361 The size of each object is the size of a cache descriptor;
362 The creation and deleting of caches is extremely rare so do not consider it for
reaping ever;
8.7 Interfacing with the Buddy Allocator
142
363 Initialise the spinlock unlocked;
364 Align the objects to the L1 cache;
365 Record the human readable name.
That statically denes all the elds that can be calculated at compile time. To
initialise the rest of the struct,
kmem_cache_init() is called from start_kernel().
8.7 Interfacing with the Buddy Allocator
The slab allocator does not come with pages attached, it must ask the physical page
allocator for its pages. Two APIs are provided for this task called
and
kmem_freepages().
kmem_getpages()
They are basically wrappers around the buddy allocators
API so that slab ags will be taken into account for allocations.
the default ags are taken from
cachep→gfpflags
For allocations,
and the order is taken from
cachep→gfporder where cachep is the cache requesting the pages. When freeing
the pages, PageClearSlab() will be called for every page being freed before calling
free_pages().
8.8 Whats New in 2.6
The rst obvious change is that the version of the
/proc/slabinfo
format has
changed from 1.1 to 2.0 and is a lot friendlier to read. The most helpful change is
that the elds now have a header negating the need to memorise what each column
means.
The principal algorithms and ideas remain the same and there is no major algorithm shakeups but the implementation is quite dierent. Particularly, there is a
greater emphasis on the use of per-cpu objects and the avoidance of locking. Secondly, there is a lot more debugging code mixed in so keep an eye out for
DEBUG
#ifdef
blocks of code as they can be ignored when reading the code rst. Lastly,
some changes are purely cosmetic with function name changes but very similar behavior.
For example,
kmem_cache_estimate()
is now called
cache_estimate()
even though they are identical in every other respect.
Cache descriptor
The changes to the
kmem_cache_s
are minimal.
First, the
elements are reordered to have commonly used elements, such as the per-cpu related
data, at the beginning of the struct (see Section 3.9 to for the reasoning). Secondly,
slabs_full) and statistics related to them have been moved to
struct kmem_list3. Comments and the unusual use of macros indicate
the slab lists (e.g.
a separate
that there is a plan to make the structure per-node.
8.8 Whats New in 2.6
Cache Static Flags
CFLGS_OPTIMIZE
143
The ags in 2.4 still exist and their usage is the same.
no longer exists but its usage in 2.4 was non-existent. Two new
ags have been introduced which are:
SLAB_STORE_USER
This is a debugging only ag for recording the function
that freed an object. If the object is used after it was freed, the poison bytes
will not match and a kernel error message will be displayed.
As the last
function to use the object is known, it can simplify debugging.
SLAB_RECLAIM_ACCOUNT
This ag is set for caches with objects that
are easily reclaimable such as inode caches. A counter is maintained in a variable called
slab_reclaim_pages
to record how many pages are used in slabs
allocated to these caches. This counter is later used in
vm_enough_memory()
to help determine if the system is truly out of memory.
Cache Reaping
allocator.
This is one of the most interesting changes made to the slab
kmem_cache_reap()
no longer exists as it is very indiscriminate in how
it shrinks caches when the cache user could have made a far superior selection.
Users of caches can now register a shrink cache callback with
for the intelligent aging and shrinking of slabs.
struct shrinker
set_shrinker()
This simple function populates a
with a pointer to the callback and a seeks weight which indi-
cates how dicult it is to recreate an object before placing it in a linked list called
shrinker_list.
During page reclaim, the function
the full
shrinker_list
shrink_slab()
is called which steps through
and calls each shrinker callback twice. The rst call passes
0 as a parameter which indicates that the callback should return how many pages
it expects it could free if it was called properly.
A basic heuristic is applied to
determine if it is worth the cost of using the callback. If it is, it is called a second
time with a parameter indicating how many objects to free.
How this mechanism accounts for the number of pages is a little tricky. Each
task struct has a eld called
reclaim_state.
When the slab allocator frees
pages, this eld is updated with the number of pages that is freed.
ing
shrink_slab(),
this eld is set to 0 and then read again after
Before call-
shrink_cache
returns to determine how many pages were freed.
Other changes
The rest of the changes are essentially cosmetic. For example, the
slab descriptor is now called
struct slab instead of slab_t which is consistent with
the general trend of moving away from typedefs. Per-cpu caches remain essentially
the same except the structs and APIs have new names. The same type of points
applies to most of the rest of the 2.6 slab allocator implementation.
Chapter 9
High Memory Management
The kernel may only directly address memory for which it has set up a page table
entry. In the most common case, the user/kernel address space split of 3GiB/1GiB
implies that at best only 896MiB of memory may be directly accessed at any given
time on a 32-bit machine as explained in Section 4.1. On 64-bit hardware, this is
not really an issue as there is more than enough virtual address space. It is highly
unlikely there will be machines running 2.4 kernels with more than terabytes of
RAM.
There are many high end 32-bit machines that have more than 1GiB of memory
and the inconveniently located memory cannot be simply ignored.
The solution
Linux uses is to temporarily map pages from high memory into the lower page
tables. This will be discussed in Section 9.2.
High memory and IO have a related problem which must be addressed, as not
all devices are able to address high memory or all the memory available to the CPU.
This may be the case if the CPU has PAE extensions enabled, the device is limited
to addresses the size of a signed 32-bit integer (2GiB) or a 32-bit device is being
used on a 64-bit architecture. Asking the device to write to memory will fail at best
and possibly disrupt the kernel at worst. The solution to this problem is to use a
bounce buer
and this will be discussed in Section 9.4.
This chapter begins with a brief description of how the Persistent Kernel Map
(PKMap) address space is managed before talking about how pages are mapped and
unmapped from high memory. The subsequent section will deal with the case where
the mapping must be atomic before discussing bounce buers in depth. Finally we
will talk about how emergency pools are used for when memory is very tight.
9.1 Managing the PKMap Address Space
Space is reserved at the top of the kernel page tables from PKMAP_BASE to
FIXADDR_START for a PKMap. The size of the space reserved varies slightly. On the
x86, PKMAP_BASE is at 0xFE000000 and the address of FIXADDR_START is a compile
time constant that varies with congure options but is typically only a few pages
located near the end of the linear address space. This means that there is slightly
144
9.2 Mapping High Memory Pages
145
below 32MiB of page table space for mapping pages from high memory into usable
space.
For mapping pages, a single page set of PTEs is stored at the beginning of the
PKMap area to allow 1024 high pages to be mapped into low memory for short
kunmap(). The pool seems
very small but the page is only mapped by kmap() for a very short time. Comments
periods with the function
kmap()
and unmapped with
in the code indicate that there was a plan to allocate contiguous page table entries
to expand this area but it has remained just that, comments in the code, so a large
portion of the PKMap is unused.
The page table entry for use with
located at
PKMAP_BASE
kmap()
is called
pkmap_page_table
and set up during system initialisation.
takes place at the end of the
pagetable_init()
which is
On the x86, this
function. The pages for the PGD
and PMD entries are allocated by the boot memory allocator to ensure they exist.
The current state of the page table entries is managed by a simple array called
called
pkmap_count which has LAST_PKMAP entries in it.
On an x86 system without
PAE, this is 1024 and with PAE, it is 512. More accurately, albeit not expressed in
code, the
LAST_PKMAP
variable is equivalent to
PTRS_PER_PTE.
Each element is not exactly a reference count but it is very close. If the entry
is 0, the page is free and has not been used since the last TLB ush. If it is 1, the
slot is unused but a page is still mapped there waiting for a TLB ush. Flushes are
delayed until every slot has been used at least once as a global ush is required for
all CPUs when the global page tables are modied and is extremely expensive. Any
higher value is a reference count of
n-1
users of the page.
9.2 Mapping High Memory Pages
The API for mapping pages from high memory is described in Table 9.1.
main function for mapping a page is
kmap_nonblock()
kmap().
The
For users that do not wish to block,
kmap_atomic(). The kmap
kmap() call kunmap() as quickly
is available and interrupt users have
pool is quite small so it is important that users of
as possible because the pressure on this small window grows incrementally worse as
the size of high memory grows in comparison to low memory.
The
kmap()
function itself is fairly simple. It rst checks to make sure an inter-
out_of_line_bug() if true.
system so out_of_line_bug()
rupt is not calling this function(as it may sleep) and calls
An interrupt handler calling
BUG()
would panic the
prints out bug information and exits cleanly. The second check is that the page is
below
highmem_start_page as pages below this mark are already visible and do not
need to be mapped.
It then checks if the page is already in low memory and simply returns the address
if it is. This way, users that need
kmap()
may use it unconditionally knowing that
if it is already a low memory page, the function is still safe. If it is a high page to
kmap_high() is called to begin the real work.
kmap_high() function begins with checking the page→virtual eld which
the page is already mapped. If it is NULL, map_new_virtual() provides a
be mapped,
The
is set if
9.2.1 Unmapping Pages
146
Figure 9.1: Call Graph:
kmap()
mapping for the page.
map_new_virtual() is a simple case of
linearly scanning pkmap_count.
The scan starts at last_pkmap_nr instead of
0 to prevent searching over the same areas repeatedly between kmap()s. When
last_pkmap_nr wraps around to 0, flush_all_zero_pkmaps() is called to set all
Creating a new virtual mapping with
entries from 1 to 0 before ushing the TLB.
If, after another scan, an entry is still not found, the process sleeps on the
pkmap_map_wait
wait queue until it is woken up after the next
kunmap().
pkmap_count
Once a mapping has been created, the corresponding entry in the
array is incremented and the virtual address in low memory returned.
9.2.1 Unmapping Pages
The API for unmapping pages from high memory is described in Table 9.2.
kunmap()
The
function, like its complement, performs two checks. The rst is an iden-
kmap() for usage from interrupt context. The second is that the page
highmem_start_page. If it is, the page already exists in low memory and
tical check to
is below
needs no further handling.
kunmap_high()
Once established that it is a page to be unmapped,
is called to perform the unmapping.
9.3 Mapping High Memory Pages Atomically
147
void * kmap(struct page *page)
Takes a struct page from high memory and maps it into low memory.
The
address returned is the virtual address of the mapping
void * kmap_nonblock(struct page *page)
This is the same as kmap() except it will not block if no slots are available and
will instead return NULL. This is not the same as kmap_atomic() which uses
specially reserved slots
void * kmap_atomic(struct page *page, enum km_type type)
There are slots maintained in the map for atomic use by interrupts (see Section
9.3). Their use is heavily discouraged and callers of this function may not sleep
or schedule. This function will map a page from high memory atomically for a
specic purpose
Table 9.1: High Memory Mapping API
kunmap_high() is simple in principle. It decrements the corresponding elefor this page in pkmap_count. If it reaches 1 (remember this means no more
but a TLB ush is required), any process waiting on the pkmap_map_wait is
The
ment
users
woken up as a slot is now available. The page is not unmapped from the page tables
then as that would require a TLB ush. It is delayed until
flush_all_zero_pkmaps()
is called.
void kunmap(struct page *page)
Unmaps a struct page from low memory and frees up the page table entry
mapping it
void kunmap_atomic(void *kvaddr, enum km_type type)
Unmap a page that was mapped atomically
Table 9.2: High Memory Unmapping API
9.3 Mapping High Memory Pages Atomically
The use of
kmap_atomic()
is discouraged but slots are reserved for each CPU for
when they are necessary, such as when bounce buers, are used by devices from
interrupt. There are a varying number of dierent requirements an architecture has
for atomic high memory mapping which are enumerated by
number of uses is
atomic kmaps.
KM_TYPE_NR.
km_type.
The total
On the x86, there are a total of six dierent uses for
9.4 Bounce Buers
148
Figure 9.2: Call Graph:
kunmap()
KM_TYPE_NR entries per processor are reserved at boot time for atomic
mapping at the location FIX_KMAP_BEGIN and ending at FIX_KMAP_END. Obviously
a user of an atomic kmap may not sleep or exit before calling kunmap_atomic() as
There are
the next process on the processor may try to use the same entry and fail.
The function
kmap_atomic() has the very simple task of mapping the requested
page to the slot set aside in the page tables for the requested type of operation
and processor.
the PTE with
The function
pte_clear()
kunmap_atomic()
is interesting as it will only clear
if debugging is enabled. It is considered unnecessary
to bother unmapping atomic pages as the next call to
kmap_atomic()
will simply
replace it making TLB ushes unnecessary.
9.4 Bounce Buers
Bounce buers are required for devices that cannot access the full range of memory
available to the CPU. An obvious example of this is when a device does not address
with as many bits as the CPU, such as 32-bit devices on 64-bit architectures or
recent Intel processors with PAE enabled.
The basic concept is very simple. A bounce buer resides in memory low enough
for a device to copy from and write data to. It is then copied to the desired user
page in high memory. This additional copy is undesirable, but unavoidable. Pages
are allocated in low memory which are used as buer pages for DMA to and from
the device.
This is then copied by the kernel to the buer page in high memory
when IO completes so the bounce buer acts as a type of bridge. There is signicant
overhead to this operation as at the very least it involves copying a full page but it
is insignicant in comparison to swapping out pages in low memory.
9.4.1 Disk Buering
149
9.4.1 Disk Buering
Blocks, typically around 1KiB are packed into pages and managed by a
buffer_head allocated by the slab allocator.
Users of buer heads have the option of
registering a callback function. This function is stored in
and called when IO completes.
struct
buffer_head→b_end_io()
It is this mechanism that bounce buers uses to
have data copied out of the bounce buers. The callback registered is the function
bounce_end_io_write().
Any other feature of buer heads or how they are used by the block layer is
beyond the scope of this document and more the concern of the IO layer.
9.4.2 Creating Bounce Buers
The creation of a bounce buer is a simple aair which is started by the
create_bounce()
function. The principle is very simple, create a new buer using a provided buer
head as a template.
The function takes two parameters which are a read/write
parameter (rw) and the template buer head to use (bh_orig).
Figure 9.3: Call Graph:
create_bounce()
A page is allocated for the buer itself with the function
which is a wrapper around
alloc_page()
alloc_bounce_page()
with one important addition.
If the
allocation is unsuccessful, there is an emergency pool of pages and buer heads
available for bounce buers. This is discussed further in Section 9.5.
The buer head is, predictably enough, allocated with
which, similar in principle to
a
buffer_head
ally,
bdush
alloc_bounce_page(),
alloc_bounce_bh()
calls the slab allocator for
and uses the emergency pool if one cannot be allocated. Addition-
is woken up to start ushing dirty buers out to disk so that buers
are more likely to be freed soon.
buffer_head have been allocated, information is copied
from the template buffer_head into the new one.
Since part of this operation may use kmap_atomic(), bounce buers are only created with the IRQ safe
io_request_lock held. The IO completion callbacks are changed to be either
Once the page and
9.4.3 Copying via bounce buers
bounce_end_io_write()
or
150
bounce_end_io_read()
depending on whether this is
a read or write buer so the data will be copied to and from high memory.
The most important aspect of the allocations to note is that the GFP ags specify
that no IO operations involving high memory may be used. This is specied with
SLAB_NOHIGHIO
to the slab allocator and
GFP_NOHIGHIO
to the buddy allocator.
This is important as bounce buers are used for IO operations with high memory. If
the allocator tries to perform high memory IO, it will recurse and eventually crash.
9.4.3 Copying via bounce buers
Figure 9.4: Call Graph:
bounce_end_io_read/write()
Data is copied via the bounce buer dierently depending on whether it is a
read or write buer. If the buer is for writes to the device, the buer is populated
with the data from high memory during bounce buer creation with the function
copy_from_high_bh().
The callback function
bounce_end_io_write()
will com-
plete the IO later when the device is ready for the data.
If the buer is for reading from the device, no data transfer may take place
until the device is ready. When it is, the interrupt handler for the device calls the
bounce_end_io_read()
copy_to_high_bh_irq().
callback function
with
which copies the data to high memory
In either case the buer head and page may be reclaimed by
bounce_end_io()
once the IO has completed and the IO completion function for the template
buffer_head()
is called.
If the emergency pools are not full, the resources are
added to the pools otherwise they are freed back to the respective allocators.
9.5 Emergency Pools
Two emergency pools of
buffer_heads
and pages are maintained for the express
use by bounce buers. If memory is too tight for allocations, failing to complete IO
requests is going to compound the situation as buers from high memory cannot be
9.6 What's New in 2.6
151
freed until low memory is available. This leads to processes halting, thus preventing
the possibility of them freeing up their own memory.
POOL_SIZE entries each which is currently dened as 32. The pages are linked via the page→list
eld on a list headed by emergency_pages. Figure 9.5 illustrates how pages are
The pools are initialised by
init_emergency_pool()
to contain
stored on emergency pools and acquired when necessary.
buffer_heads are very similar as they linked via the buffer_head→inode_buffers
on a list headed by emergency_bhs. The number of entries left on the pages and
buer lists are recorded by two counters nr_emergency_pages and nr_emergency_bhs
respectively and the two lists are protected by the emergency_lock spinlock.
The
Figure 9.5: Acquiring Pages from Emergency Pools
9.6 What's New in 2.6
Memory Pools
In 2.4, the high memory manager was the only subsystem that
maintained emergency pools of pages. In 2.6, memory pools are implemented as a
generic concept when a minimum amount of stu needs to be reserved for when
memory is tight.
Stu in this case can be any type of object such as pages in
the case of the high memory manager or, more frequently, some object managed
by the slab allocator. Pools are initialised with
number of arguments.
mempool_create()
which takes a
They are the minimum number of objects that should be
reserved (min_nr), an allocator function for the object type (alloc_fn()), a free
function (free_fn()) and optional private data that is passed to the allocate and
free functions.
9.6 What's New in 2.6
152
The memory pool API provides two generic allocate and free functions called
mempool_alloc_slab()
and
mempool_free_slab().
When the generic functions
are used, the private data is the slab cache that objects are to be allocated and
freed from.
In the case of the high memory manager, two pools of pages are created. On
page pool is for normal use and the second page pool is for use with ISA devices that
must allocate from
ZONE_DMA. The allocate function is page_pool_alloc() and the
private data parameter passed indicates the GFP ags to use. The free function is
page_pool_free().
The memory pools replace the emergency pool code that exists
in 2.4.
To allocate or free objects from the memory pool, the memory pool API functions
mempool_alloc() and mempool_free()
with mempool_destroy().
Mapping High Memory Pages
are provided. Memory pools are destroyed
In 2.4, the eld
store the address of the page within the
pkmap_count
page→virtual
was used to
array. Due to the number of
struct pages that exist in a high memory system, this is a very large penalty to pay
for the relatively small number of pages that need to be mapped into ZONE_NORMAL.
2.6 still has this pkmap_count array but it is managed very dierently.
In 2.6, a hash table called page_address_htable is created.
This table is
hashed based on the address of the struct page and the list is used to locate
struct page_address_slot. This struct has two elds of interest, a struct page
and a virtual address. When the kernel needs to nd the virtual address used by a
mapped page, it is located by traversing through this hash bucket. How the page
is actually mapped into lower memory is essentially the same as 2.4 except now
page→virtual
is no longer required.
Performing IO
struct bio is now used instead of the struct buffer_head when performing IO. How bio structures work
is beyond the scope of this book. However, the principle reason that bio structures
The last major change is that the
were introduced is so that IO could be performed in blocks of whatever size the underlying device supports. In 2.4, all IO had to be broken up into page sized chunks
regardless of the transfer rate of the underlying device.
Chapter 10
Page Frame Reclamation
A running system will eventually use all available page frames for purposes like disk
buers, dentries, inode entries, process pages and so on. Linux needs to select old
pages which can be freed and invalidated for new uses before physical memory is
exhausted. This chapter will focus exclusively on how Linux implements its page
replacement policy and how dierent types of pages are invalidated.
The methods Linux uses to select pages are rather empirical in nature and the
theory behind the approach is based on multiple dierent ideas. It has been shown
to work well in practice and adjustments are made based on user feedback and
benchmarks. The basics of the page replacement policy is the rst item of discussion
in this Chapter.
The second topic of discussion is the
Page cache .
All data that is read from disk
is stored in the page cache to reduce the amount of disk IO that must be performed.
Strictly speaking, this is not directly related to page frame reclamation, but the
LRU lists and page cache are closely related. The relevant section will focus on how
pages are added to the page cache and quickly located.
This will being us to the third topic, the LRU lists. With the exception of the
slab allocator, all pages in use by the system are stored on LRU lists and linked
together via
page→lru
so they can be easily scanned for replacement.
The slab
pages are not stored on the LRU lists as it is considerably more dicult to age a
page based on the objects used by the slab. The section will focus on how pages
move through the LRU lists before they are reclaimed.
From there, we'll cover how pages belonging to other caches, such as the dcache,
and the slab allocator are reclaimed before talking about how process-mapped pages
are removed. Process mapped pages are not easily swappable as there is no way to
map
struct pages
to PTEs except to search every page table which is far too
expensive. If the page cache has a large number of process-mapped pages in it, process page tables will be walked and pages swapped out by
swap_out() until enough
pages have been freed but this will still have trouble with shared pages. If a page is
shared, a swap entry is allocated, the PTE lled with the necessary information to
nd the page in swap again and the reference count decremented. Only when the
count reaches zero will the page be freed. Pages like this are considered to be in the
Swap cache .
153
10.1 Page Replacement Policy
154
Finally, this chaper will cover the page replacement daemon
kswapd,
how it is
implemented and what it's responsibilities are.
10.1 Page Replacement Policy
During discussions the page replacement policy is frequently said to be a
Recently Used (LRU) -based
Least
algorithm but this is not strictly speaking true as the
lists are not strictly maintained in LRU order. The LRU in Linux consists of two lists
called the
active_list and inactive_list. The objective is for the active_list
working set [Den70] of all processes and the inactive_list to contain
to contain the
reclaim canditates. As all reclaimable pages are contained in just two lists and pages
belonging to any process may be reclaimed, rather than just those belonging to a
faulting process, the replacement policy is a global one.
The lists resemble a simplied LRU 2Q [JS94] where two lists called
A1
are maintained.
FIFO queue called
Am
and
With LRU 2Q, pages when rst allocated are placed on a
A1.
If they are referenced while on that queue, they are placed
Am. This is roughly analogous to using
lru_cache_add() to place pages on a queue called inactive_list (A1) and using mark_page_accessed() to get moved to the active_list (Am). The algoin a normal LRU managed list called
rithm describes how the size of the two lists have to be tuned but Linux takes a
refill_inactive() to move pages from the bottom
inactive_list to keep active_list about two thirds the size
simpler approach by using
of
active_list
of
to
the total page cache. Figure 10.1 illustrates how the two lists are structured, how
pages are added and how pages move between the lists with
refill_inactive().
The lists described for 2Q presumes Am is an LRU list but the list in Linux
closer resembles a Clock algorithm [Car84] where the hand-spread is the size of the
active list. When pages reach the bottom of the list, the referenced ag is checked,
if it is set, it is moved back to the top of the list and the next page checked. If it is
cleared, it is moved to the
inactive_list.
The Move-To-Front heuristic means that the lists behave in an LRU-like manner
but there are too many dierences between the Linux replacement policy and LRU
to consider it a stack algorithm [MM87]. Even if we ignore the problem of analysing
multi-programmed systems [CD80] and the fact the memory size for each process is
not xed , the policy does not satisfy the
inclusion property
as the location of pages
in the lists depend heavily upon the size of the lists as opposed to the time of last
reference. Neither is the list priority ordered as that would require list updates with
every reference.
As a nal nail in the stack algorithm con, the lists are almost
ignored when paging out from processes as pageout decisions are related to their
location in the virtual address space of the process rather than the location within
the page lists.
In summary, the algorithm does exhibit LRU-like behaviour and it has been
shown by benchmarks to perform well in practice. There are only two cases where
the algorithm is likely to behave really badly. The rst is if the candidates for reclamation are principally anonymous pages. In this case, Linux will keep examining
10.2 Page Cache
155
Figure 10.1: Page Cache LRU Lists
a large number of pages before linearly scanning process page tables searching for
pages to reclaim but this situation is fortunately rare.
The second situation is where there is a single process with many le backed
resident pages in the
and
inactive_list that are being written to frequently.
kswapd may go into a loop of constantly laundering
them at the top of the
Processes
these pages and placing
inactive_list without freeing anything. In this case, few
active_list to inactive_list as the ratio between the
pages are moved from the
two lists sizes remains not change signicantly.
10.2 Page Cache
The page cache is a set of data structures which contain pages that are backed by
regular les, block devices or swap.
There are basically four types of pages that
exist in the cache:
•
Pages that were faulted in as a result of reading a memory mapped le;
•
Blocks read from a block device or lesystem are packed into special pages
called buer pages. The number of blocks that may t depends on the size of
the block and the page size of the architecture;
10.2.1 Page Cache Hash Table
•
156
Anonymous pages exist in a special aspect of the page cache called the swap
cache when slots are allocated in the backing storage for page-out, discussed
further in Chapter 11;
•
Pages belonging to shared memory regions are treated in a similar fashion to
anonymous pages. The only dierence is that shared pages are added to the
swap cache and space reserved in backing storage immediately after the rst
write to the page.
The principal reason for the existance of this cache is to eliminate unnecessary
disk reads. Pages read from disk are stored in a
the
struct address_space
page hash
table which is hashed on
and the oset which is always searched before the disk
is accessed. An API is provided that is responsible for manipulating the page cache
which is listed in Table 10.1.
10.2.1 Page Cache Hash Table
There is a requirement that pages in the page cache be quickly located.
facilitate this, pages are inserted into a table
page→next_hash
and
page_hash_table
page→pprev_hash are used to
mm/filemap.c:
To
and the elds
handle collisions.
The table is declared as follows in
45 atomic_t page_cache_size = ATOMIC_INIT(0);
46 unsigned int page_hash_bits;
47 struct page **page_hash_table;
The table is allocated during system initialisation by
page_cache_init() which
takes the number of physical pages in the system as a parameter. The desired size
of the table (htable_size) is enough to hold pointers to every
struct page
in the
system and is calculated by
htable_size = num_physpages ∗ sizeof(struct page ∗)
To allocate a table, the system begins with an
order
allocation large enough to
contain the entire table. It calculates this value by starting at 0 and incrementing it
order
until 2
> htable_size. This may be roughly expressed as the integer component
of the following simple equation.
order = log2 ((htable_size ∗ 2) − 1))
An attempt is made to allocate this order of pages with
__get_free_pages().
If the allocation fails, lower orders will be tried and if no allocation is satised, the
system panics.
page_hash_bits is based on the size of the table for use with the
function _page_hashfn(). The value is calculated by successive divides by
The value of
hashing
two but in real terms, this is equivalent to:
10.2.2 Inode Queue
157
void add_to_page_cache(struct page * page, struct address_space *
mapping, unsigned long offset)
Adds a page to the LRU with lru_cache_add() in addition to adding it to
the inode queue and page hash tables
void add_to_page_cache_unique(struct page * page, struct
address_space *mapping, unsigned long offset, struct page **hash)
This is imilar to add_to_page_cache() except it checks that the page is not
already in the page cache.
pagecache_lock
This is required when the caller does not hold the
spinlock
void remove_inode_page(struct page *page)
This
function
removes
a
page
remove_page_from_inode_queue()
from
and
the
inode
and
hash
queues
with
remove_page_from_hash_queue(),
ef-
fectively removing the page from the page cache
struct page * page_cache_alloc(struct address_space *x)
This is a wrapper around alloc_pages() which uses x→gfp_mask as the GFP
mask
void page_cache_get(struct page *page)
Increases the reference count to a page already in the page cache
int page_cache_read(struct file * file, unsigned long offset)
This function adds a page corresponding to an offset with a file
is not already there.
if it
If necessary, the page will be read from disk using an
address_space_operations→readpage
function
void page_cache_release(struct page *page)
An alias for __free_page(). The reference count
is decremented and if it
drops to 0, the page will be freed
Table 10.1: Page Cache API
page_hash_bits = log2
PAGE_SIZE ∗ 2order
sizeof(struct page ∗)
This makes the table a power-of-two hash table which negates the need to use a
modulus which is a common choice for hashing functions.
10.2.2 Inode Queue
The
inode queue
struct address_space introduced in Section 4.4.2.
lists: clean_pages is a list of clean pages associated
is part of the
The struct contains three
10.2.3 Adding Pages to the Page Cache
dirty_pages which have
locked_pages which are those
158
with the inode;
been written to since the list sync to
disk; and
currently locked.
These three lists in
combination are considered to be the inode queue for a given mapping and the
page→list eld is used to link pages on it. Pages are
with add_page_to_inode_queue() which places pages
and removed with remove_page_from_inode_queue().
added to the inode queue
on the
clean_pages
lists
10.2.3 Adding Pages to the Page Cache
Pages read from a le or block device are generally added to the page cache to avoid
further disk IO. Most lesystems use the high level function
as their
file_operations→read().
generic_file_read()
The shared memory lesystem, which is cov-
ered in Chatper 12, is one noteworthy exception but, in general, lesystems perform
their operations through the page cache.
illustrate how
cache.
generic_file_read()
For the purposes of this section, we'll
operates and how it adds pages to the page
1
generic_file_read() begins with a few basic checks before calling do_generic_file_read(). This searches the page cache, by calling
__find_page_nolock() with the pagecache_lock held, to see if the page already
exists in it. If it does not, a new page is allocated with page_cache_alloc(),
which is a simple wrapper around alloc_pages(), and added to the page cache
with __add_to_page_cache(). Once a page frame is present in the page cache,
generic_file_readahead() is called which uses page_cache_read() to read the
page from disk. It reads the page using mapping→a_ops→readpage(), where
mapping is the address_space managing the le. readpage() is the lesystem
For normal IO ,
specic function used to read a page on disk.
Figure 10.2: Call Graph:
generic_file_read()
Anonymous pages are added to the swap cache when they are unmapped from a
process, which will be discussed further in Section 11.4. Until an attempt is made
to swap them out, they have no
address_space
1 Direct IO is handled dierently with
acting as a mapping or any oset
generic_file_direct_IO().
10.3 LRU Lists
159
within a le leaving nothing to hash them into the page cache with. Note that these
pages still exist on the LRU lists however. Once in the swap cache, the only real
dierence between anonymous pages and le backed pages is that anonymous pages
will use
swapper_space
as their
struct address_space.
Shared memory pages are added during one of two cases.
shmem_getpage_locked()
The rst is during
which is called when a page has to be either fetched
from swap or allocated as it is the rst reference. The second is when the swapout
code calls
shmem_unuse().
This occurs when a swap area is being deactivated and a
page, backed by swap space, is found that does not appear to belong to any process.
The inodes related to shared memory are exhaustively searched until the correct
page is found. In both cases, the page is added with
Figure 10.3: Call Graph:
add_to_page_cache().
add_to_page_cache()
10.3 LRU Lists
As stated in Section 10.1, the LRU lists consist of two lists called active_list and
inactive_list. They are declared in mm/page_alloc.c and are protected by the
pagemap_lru_lock spinlock. They, broadly speaking, store the hot and cold
pages respectively, or in other words, the active_list contains all the working sets
in the system and inactive_list contains reclaim canditates. The API which deals
with the LRU lists that is listed in Table 10.2.
10.3.1 Relling inactive_list
active_list to the
refill_inactive(). It takes as a parameter the
number of pages to move, which is calculated in shrink_caches() as a ratio depending on nr_pages, the number of pages in active_list and the number of pages
in inactive_list. The number of pages to move is calculated as
When caches are being shrunk, pages are moved from the
inactive_list
by the function
pages = nr_pages ∗
nr_active_pages
2 ∗ (nr_inactive_pages + 1)
10.3.2 Reclaiming Pages from the LRU Lists
160
void lru_cache_add(struct page * page)
Add a cold page to the inactive_list. Will be moved to active_list with
a call to mark_page_accessed() if the page is known to be hot, such as when a
page is faulted in.
void lru_cache_del(struct page *page)
Removes
a
page
from
the
del_page_from_active_list()
or
LRU
lists
by
calling
either
del_page_from_inactive_list(),
whichever is appropriate.
void mark_page_accessed(struct page *page)
Mark that the page has been accessed. If it was not recently referenced (in the
inactive_list
PG_referenced ag not set),
second time, activate_page() is
and
it is referenced a
the referenced ag is set.
If
called, which marks the page
hot, and the referenced ag is cleared
void activate_page(struct page * page)
Removes a page from the inactive_list and
active_list. It is
very rarely called directly as the caller has to know the page is on inactive_list.
mark_page_accessed() should be used instead
places it on
Table 10.2: LRU List API
This keeps the
active_list
about two thirds the size of the
inactive_list
and the number of pages to move is determined as a ratio based on how many pages
we desire to swap out (nr_pages).
Pages are taken from the end of the
active_list.
set, it is cleared and the page is put back at top of the
PG_referenced ag is
active_list as it has been
If the
recently used and is still hot. This is sometimes referred to as rotating the list. If
inactive_list and the PG_referenced
promoted to the active_list if necessary.
the ag is cleared, it is moved to the
set so that it will be quickly
ag
10.3.2 Reclaiming Pages from the LRU Lists
shrink_cache() is the part of the replacement algorithm which takes
the inactive_list and decides how they should be swapped out. The
The function
pages from
two starting parameters which determine how much work will be performed are
nr_pages
priority. nr_pages starts out as SWAP_CLUSTER_MAX, currently
mm/vmscan.c. The variable priority starts as DEF_PRIORITY,
currently dened as 6 in mm/vmscan.c.
Two parameters, max_scan and max_mapped determine how much work the
function will do and are aected by the priority.
Each time the function
shrink_caches() is called without enough pages being freed, the priority will be
and
dened as 32 in
decreased until the highest priority 1 is reached.
10.3.2 Reclaiming Pages from the LRU Lists
The variable
161
max_scan is the maximum number of pages will be scanned by this
function and is simply calculated as
max_scan =
inactive_list. This
means that at lowest priority 6, at most one sixth of the pages in the inactive_list
where
nr_inactive_pages
nr_inactive_pages
priority
is the number of pages in the
will be scanned and at highest priority, all of them will be.
The second parameter is
max_mapped which determines how many process pages
are allowed to exist in the page cache before whole processes will be swapped out.
This is calculated as the minimum of either one tenth of
max_scan
or
max_mapped = nr_pages ∗ 2(10−priority)
In other words, at lowest priority, the maximum number of mapped pages allowed is either one tenth of
max_scan
or 16 times the number of pages to swap out
(nr_pages) whichever is the lower number. At high priority, it is either one tenth
of
max_scan
or 512 times the number of pages to swap out.
From there, the function is basically a very large for-loop which scans at most
max_scan
until the
pages to free up
inactive_list
nr_pages
is empty.
pages from the end of the
inactive_list
or
After each page, it checks to see whether it
should reschedule itself so that the swapper does not monopolise the CPU.
For each type of page found on the list, it makes a dierent decision on what to
do. The dierent page types and actions taken are handled in this order:
Page is mapped by a process.
This jumps to the
will meet again in a later case. The
page_mapped
label which we
max_mapped count is decremented.
If it reaches
0, the page tables of processes will be linearly searched and swapped out by the
swap_out()
Page is locked and the PG_launder bit is set. The page is locked for IO so could
skipped over. However, if the PG_launder bit is set, it means that this is the
function
be
second time the page has been found locked so it is better to wait until the IO com-
page_cache_get()
wait_on_page() is called which
pletes and get rid of it. A reference to the page is taken with
so that the page will not be freed prematurely and
sleeps until the IO is complete. Once it is completed, the reference count is decremented with
page_cache_release().
When the count reaches zero, the page will
be reclaimed.
Page is dirty, is unmapped by all processes, has no buers and belongs to a
device or le mapping. As the page belongs to a le or device mapping, it has
a valid writepage() function available via page→mapping→a_ops→writepage.
The PG_dirty bit is cleared and the PG_launder bit is set as it is about to start
IO. A reference is taken for the page with page_cache_get() before calling the
writepage() function to synchronise the page with the backing le before dropping
the reference with page_cache_release(). Be aware that this case will also synchronise anonymous pages that are part of the swap cache with the backing storage
as swap cache pages use
swapper_space as a page→mapping.
The page remains on
10.4 Shrinking all caches
162
the LRU. When it is found again, it will be simply freed if the IO has completed
and the page will be reclaimed. If the IO has not completed, the kernel will wait for
the IO to complete as described in the previous case.
Page has buers associated with data on disk.
A reference is taken to the page
and an attempt is made to free the pages with
If it
succeeds and is an anonymous page (no
the LRU and
page_cache_released()
try_to_release_page().
page→mapping, the page is removed
from
called to decrement the usage count. There
is only one case where an anonymous page has associated buers and that is when
it is backed by a swap le as the page needs to be written out in block-sized chunk.
If, on the other hand, it is backed by a le or device, the reference is simply dropped
and the page will be freed as usual when the count reaches 0.
Page is anonymous and is mapped by more than one process. The LRU is unlocked and the page is unlocked before dropping into the same page_mapped label
that was encountered in the rst case. In other words, the max_mapped count is
decremented and swap_out called when, or if, it reaches 0.
Page has no process referencing it. This is the nal case that is fallen into
rather than explicitly checked for. If the page is in the swap cache, it is removed
from it as the page is now sychronised with the backing storage and has no process
referencing it. If it was part of a le, it is removed from the inode queue, deleted
from the page cache and freed.
10.4 Shrinking all caches
The function responsible for shrinking the various caches is
shrink_caches() which
takes a few simple steps to free up some memory. The maximum number of pages
nr_pages which is initialised by
SWAP_CLUSTER_MAX. The limitation is there so
that will be written to disk in any given pass is
try_to_free_pages_zone()
that if
kswapd
to be
schedules a large number of pages to be written to disk, it will
sleep occasionally to allow the IO to take place. As pages are freed,
nr_pages
is
decremented to keep count.
The amount of work that will be performed also depends on the
tialised by
try_to_free_pages_zone()
to be
DEF_PRIORITY.
priority
ini-
For each pass that
does not free up enough pages, the priority is decremented for the highest priority
been 1.
kmem_cache_reap() (see Section 8.1.7) which selects a
nr_pages number of pages are freed, the work is complete
and the function returns otherwise it will try to free nr_pages from other caches.
If other caches are to be aected, refill_inactive() will move pages from the
active_list to the inactive_list before shrinking the page cache by reclaiming
pages at the end of the inactive_list with shrink_cache().
Finally, it shrinks three special caches, the dcache (shrink_dcache_memory()),
the icache (shrink_icache_memory()) and the dqcache (shrink_dqcache_memory()).
The function rst calls
slab cache to shrink. If
These objects are quite small in themselves but a cascading eect allows a lot more
pages to be freed in the form of buer and disk caches.
10.5 Swapping Out Process Pages
163
Figure 10.4: Call Graph:
shrink_caches()
10.5 Swapping Out Process Pages
When
max_mapped
swap_out() is called
mm_struct pointed to by
pages have been found in the page cache,
to start swapping out process pages.
Starting from the
swap_mm and the address mm→swap_address,
until nr_pages have been freed.
the page tables are searched forward
Figure 10.5: Call Graph:
swap_out()
All process mapped pages are examined regardless of where they are in the lists
or when they were last referenced but pages which are part of the
active_list
or
have been recently referenced will be skipped over. The examination of hot pages
is a bit costly but insignicant in comparison to linearly searching all processes for
the PTEs that reference a particular
struct page.
Once it has been decided to swap out pages from a process, an attempt will be
10.6 Pageout Daemon (kswapd)
made to swap out at least
mm_structs
164
SWAP_CLUSTER_MAX
number of pages and the full list of
will only be examined once to avoid constant looping when no pages
are available. Writing out the pages in bulk increases the chance that pages close
together in the process address space will be written out to adjacent slots on disk.
The marker
swap_mm
is initialised to point to
is initialised to 0 the rst time it is used.
swap_address
init_mm
and the
swap_address
A task has been fully searched when
TASK_SIZE. Once a task has been selected to swap
pages from, the reference count to the mm_struct is incremented so that it will not be
freed early and swap_out_mm() is called with the selected mm_struct as a parameter.
This function walks each VMA the process holds and calls swap_out_vma() for it.
the
is equal to
This is to avoid having to walk the entire page table which will be largely sparse.
swap_out_pgd() and swap_out_pmd() walk the page tables for given VMA until
nally try_to_swap_out() is called on the actual page and PTE.
The function try_to_swap_out() rst checks to make sure that the page is not
part of the active_list, has been recently referenced or belongs to a zone that we
are not interested in.
Once it has been established this is a page to be swapped
out, it is removed from the process page tables. The newly removed PTE is then
checked to see if it is dirty. If it is, the
struct page ags will be updated to match
so that it will get synchronised with the backing storage. If the page is already a
part of the swap cache, the RSS is simply updated and the reference to the page is
dropped, otherwise the process is added to the swap cache. How pages are added to
the swap cache and synchronised with backing storage is discussed in Chapter 11.
10.6 Pageout Daemon (kswapd)
During system startup, a kernel thread called
which continuously executes the function
sleeps.
low.
kswapd is started from kswapd_init()
kswapd()
in
mm/vmscan.c
which usually
This daemon is responsible for reclaiming pages when memory is running
Historically,
kswapd
used to wake up every 10 seconds but now it is only
woken by the physical page allocator when the
pages_low
number of free pages in
a zone is reached (see Section 2.2.1).
It is this daemon that performs most of the tasks needed to maintain the
page cache correctly, shrink slab caches and swap out processes if necessary. Unlike swapout daemons such, as Solaris [MM01], which are woken up with increasing frequency as there is memory pressure,
til the
pages_high
watermark is reached.
cesses will do the work of
forms the same task as
When
•
keeps freeing pages un-
Under extreme memory pressure, pro-
kswapd synchronously by calling balance_classzone()
try_to_free_pages_zone(). As
try_to_free_pages_zone() where the physical
which calls
kswapd
shown in Figure 10.6,
it is at
page allocator synchonously per-
kswapd when the zone is under heavy pressure.
kswapd is woken up, it performs the following:
kswapd_can_sleep() which cycles through
need_balance eld in the struct zone_t. If any
Calls
not sleep;
all zones checking the
of them are set, it can
10.7 What's New in 2.6
165
Figure 10.6: Call Graph:
•
If it cannot sleep, it is removed from the
•
Calls the functions
kswapd()
kswapd_wait
wait queue;
kswapd_balance(), which cycles through all zones. It will
try_to_free_pages_zone() if need_balance is set
freeing until the pages_high watermark is reached;
free pages in a zone with
and will keep
•
The task queue for
•
Add
tq_disk
is run so that pages queued will be written out;
kswapd back to the kswapd_wait queue and go back to the rst step.
10.7 What's New in 2.6
kswapd
As stated in Section 2.6, there is now a
kswapd for every memory node
in the system. These daemons are still started from
kswapd()
and they all execute
the same code except their work is conned to their local node. The main changes
to the implementation of
The basic operation
balance_pgdat()
kswapd are related to the kswapd-per-node change.
of kswapd remains the same.
Once woken, it calls
balance_pgdat() has two
nr_pages == 0, it will continually try to
free pages from each zone in the local pgdat until pages_high is reached. When
nr_pages is specied, it will try and free either nr_pages or MAX_CLUSTER_MAX * 8,
for the
modes of operation.
pgdat
it is responsible for.
When called with
whichever is the smaller number of pages.
Balancing Zones
The two main functions called by balance_pgdat() to free
shrink_slab() and shrink_zone(). shrink_slab() was covered in Secso will not be repeated here. The function shrink_zone() is called to free
pages are
tion 8.8
a number of pages based on how urgent it is to free pages. This function behaves
10.7 What's New in 2.6
166
very similar to how 2.4 works.
pages from
zone→active_list
refill_inactive_zone() will move a
to zone→inactive_list. Remember
number of
as covered
in Section 2.6, that LRU lists are now per-zone and not global as they are in 2.4.
shrink_cache()
is called to remove pages from the LRU and reclaim pages.
Pageout Pressure
In 2.4, the pageout priority determined how many pages
would be scanned.
In 2.6,
zone_adj_pressure().
there is a decaying average that is updated by
zone→pressure
This adjusts the
eld to indicate how
many pages should be scanned for replacement. When more pages are required, this
will be pushed up towards the highest value of
DEF_PRIORITY 10 and then decays
over time. The value of this average aects how many pages will be scanned in a
zone for replacement. The objective is to have page replacement start working and
slow gracefully rather than act in a bursty nature.
Manipulating LRU Lists
In 2.4, a spinlock would be acquired when removing
pages from the LRU list. This made the lock very heavily contended so, to relieve
contention, operations involving the LRU lists take place via
struct pagevec struc-
tures. This allows pages to be added or removed from the LRU lists in batches of
up to
PAGEVEC_SIZE
numbers of pages.
refill_inactive_zone() and shrink_cache() are removacquire the zone→lru_lock lock, remove large blocks of pages and
To illustrate, when
ing pages, they
store them on a temporary list.
shrink_list()
Once the list of pages to remove is assembled,
is called to perform the actual freeing of pages which can now per-
form most of it's task without needing the
zone→lru_lock
spinlock.
When adding the pages back, a new page vector struct is initialised with
pagevec_init().
Pages are added to the vector with
committed to being placed on the LRU list in bulk with
There is a sizable API associated with
<linux/pagevec.h>
pagevec
pagevec_add() and then
pagevec_release().
structs which can be seen in
with most of the implementation in
mm/swap.c.
Chapter 11
Swap Management
Just as Linux uses free memory for purposes such as buering data from disk, there
eventually is a need to free up private or anonymous pages used by a process. These
pages, unlike those backed by a le on disk, cannot be simply discarded to be read
in later.
Instead they have to be carefully copied to
called the
swap area .
backing storage ,
sometimes
This chapter details how Linux uses and manages its backing
storage.
Strictly speaking, Linux does not swap as swapping refers to coping an entire
process address space to disk and paging to copying out individual pages. Linux
actually implements paging as modern hardware supports it, but traditionally has
called it swapping in discussions and documentation.
To be consistent with the
Linux usage of the word, we too will refer to it as swapping.
There are two principle reasons that the existence of swap space is desirable.
First, it expands the amount of memory a process may use. Virtual memory and
swap space allows a large process to run even if the process is only partially resident.
As old pages may be swapped out, the amount of memory addressed may easily
exceed RAM as demand paging will ensure the pages are reloaded if necessary.
1
The casual reader
may think that with a sucient amount of memory, swap is
unnecessary but this brings us to the second reason. A signicant number of the
pages referenced by a process early in its life may only be used for initialisation and
then never used again. It is better to swap out those pages and create more disk
buers than leave them resident and unused.
It is important to note that swap is not without its drawbacks and the most
important one is the most obvious one; Disk is slow, very very slow. If processes are
frequently addressing a large amount of memory, no amount of swap or expensive
high-performance disks will make it run within a reasonable time, only more RAM
will help. This is why it is very important that the correct page be swapped out
as discussed in Chapter 10, but also that related pages be stored close together in
the swap space so they are likely to be swapped in at the same time while reading
ahead. We will start with how Linux describes a swap area.
This chapter begins with describing the structures Linux maintains about each
1 Not to mention the auent reader.
167
11.1 Describing the Swap Area
168
active swap area in the system and how the swap area information is organised on
disk. We then cover how Linux remembers how to nd pages in the swap after they
have been paged out and how swap slots are allocated. After that the
is discussed which is important for shared pages.
Swap Cache
At that point, there is enough
information to begin understanding how swap areas are activated and deactivated,
how pages are paged in and paged out and nally how the swap area is read and
written to.
11.1 Describing the Swap Area
Each active swap area, be it a le or partition, has a struct
swap_info_struct
describing the area. All the structs in the running system are stored in a statically
declared array called
swap_info
which holds
MAX_SWAPFILES,
which is statically
dened as 32, entries. This means that at most 32 swap areas can exist on a running
system. The
swap_info_struct
is declared as follows in
<linux/swap.h>:
64 struct swap_info_struct {
65
unsigned int flags;
66
kdev_t swap_device;
67
spinlock_t sdev_lock;
68
struct dentry * swap_file;
69
struct vfsmount *swap_vfsmnt;
70
unsigned short * swap_map;
71
unsigned int lowest_bit;
72
unsigned int highest_bit;
73
unsigned int cluster_next;
74
unsigned int cluster_nr;
75
int prio;
76
int pages;
77
unsigned long max;
78
int next;
79 };
Here is a small description of each of the elds in this quite sizable struct.
ags
This is a bit eld with two possible values.
SWP_USED
is set if the swap area
SWP_WRITEOK is dened as 3, the two lowest signicant
SWP_USED bit. The ags is set to SWP_WRITEOK when Linux
is currently active.
bits,
including
the
is ready to write to the area as it must be active to be written to;
swap_device
The device corresponding to the partition used for this swap area
is stored here. If the swap area is a le, this is NULL;
sdev_lock
As with many structs in Linux, this one has to be protected too.
sdev_lock
swap_map. It is
locked and unlocked with swap_device_lock() and swap_device_unlock();
is a spinlock protecting the struct, principally the
11.1 Describing the Swap Area
swap_le
This is the
169
dentry for the actual special le that is mounted as a swap
dentry for a le in the /dev/ directory for example
area. This could be the
in the case a partition is mounted. This eld is needed to identify the correct
swap_info_struct
vfs_mount
when deactiating a swap area;
This is the
vfs_mount
object corresponding to where the device or
le for this swap area is stored;
swap_map
This is a large array with one entry for every swap entry, or page
sized slot in the area. An entry is a reference count of the number of users of
this page slot. The swap cache counts as one user and every PTE that has
been paged out to the slot counts as a user. If it is equal to
slot is allocated permanently. If equal to
SWAP_MAP_MAX, the
SWAP_MAP_BAD, the slot will never be
used;
lowest_bit
This is the lowest possible free slot available in the swap area and
is used to start from when linearly scanning to reduce the search space. It is
known that there are denitely no free slots below this mark;
highest_bit
This is the highest possible free slot available in this swap area.
Similar to
cluster_next
lowest_bit,
there are denitely no free slots above this mark;
This is the oset of the next cluster of blocks to use. The swap area
tries to have pages allocated in cluster blocks to increase the chance related
pages will be stored together;
cluster_nr
prio
This the number of pages left to allocate in this cluster;
Each swap area has a priority which is stored in this eld. Areas are arranged
in order of priority and determine how likely the area is to be used. By default
the priorities are arranged in order of activation but the system administrator
may also specify it using the
pages
-p
ag when using
swapon;
As some slots on the swap le may be unusable, this eld stores the number
of usable pages in the swap area. This diers from
SWAP_MAP_BAD
max
in that slots marked
are not counted;
max
This is the total number of slots in this swap area;
next
This is the index in the
swap_info array of the next swap area in the system.
The areas, though stored in an array, are also kept in a pseudo list called
swap_list
which is a very simple type declared as follows in
<linux/swap.h>:
153 struct swap_list_t {
154
int head;
/* head of priority-ordered swapfile list */
155
int next;
/* swapfile to be used next */
156 };
11.1 Describing the Swap Area
170
swap_list_t→head is the swap area of the highest priority swap area
swap_list_t→next is the next swap area that should be used. This is
The eld
in use and
so areas may be arranged in order of priority when searching for a suitable area but
still looked up quickly in the array when necessary.
Each swap area is divided up into a number of page sized slots on disk which
means that each slot is 4096 bytes on the x86 for example. The rst slot is always
reserved as it contains information about the swap area that should not be overwritten. The rst 1 KiB of the swap area is used to store a disk label for the partition
that can be picked up by userspace tools.
The remaining space is used for infor-
mation about the swap area which is lled when the swap area is created with the
system program
mkswap.
The information is used to ll in a
which is declared as follows in
<linux/swap.h>:
union swap_header
25 union swap_header {
26
struct
27
{
28
char reserved[PAGE_SIZE - 10];
29
char magic[10];
30
} magic;
31
struct
32
{
33
char
bootbits[1024];
34
unsigned int version;
35
unsigned int last_page;
36
unsigned int nr_badpages;
37
unsigned int padding[125];
38
unsigned int badpages[1];
39
} info;
40 };
A description of each of the elds follows
magic
The
magic part of the union is used just for identifying the magic
string.
The string exists to make sure there is no chance a partition that is not a
swap area will be used and to decide what version of swap area is is.
If
the string is SWAP-SPACE, it is version 1 of the swap le format. If it is
SWAPSPACE2, it is version 2. The large reserved array is just so that the
magic string will be read from the end of the page;
bootbits
This is the reserved area containing information about the partition
such as the disk label;
version
This is the version of the swap area layout;
last_page
This is the last usable page in the area;
11.2 Mapping Page Table Entries to Swap Entries
nr_badpages
171
The known number of bad pages that exist in the swap area are
stored in this eld;
padding
A disk section is usually about 512 bytes in size.
version, last_page
and
nr_badpages
The three elds
make up 12 bytes and the
padding
lls up the remaining 500 bytes to cover one sector;
badpages
The remainder of the page is used to store the indices of up to
MAX_SWAP_BADPAGES
the
number of bad page slots.
These slots are lled in by
mkswap system program if the -c switch is specied to check the area.
MAX_SWAP_BADPAGES is a compile time constant which varies if the struct changes
but it is 637 entries in its current form as given by the simple equation;
MAX_SWAP_BADPAGES =
PAGE_SIZE − 1024 − 512 − 10
sizeof(long)
Where 1024 is the size of the bootblock, 512 is the size of the padding and 10 is
the size of the magic string identing the format of the swap le.
11.2 Mapping Page Table Entries to Swap Entries
When a page is swapped out, Linux uses the corresponding PTE to store enough
information to locate the page on disk again. Obviously a PTE is not large enough in
itself to store precisely where on disk the page is located, but it is more than enough
to store an index into the
swap_info
array and an oset within the
swap_map
and
this is precisely what Linux does.
Each PTE, regardless of architecture, is large enough to store a
which is declared as follows in
<linux/shmem_fs.h>
swp_entry_t
16 typedef struct {
17
unsigned long val;
18 } swp_entry_t;
Two macros are provided for the translation of PTEs to swap entries and vice
versa. They are
pte_to_swp_entry()
and
swp_entry_to_pte()
respectively.
Each architecture has to be able to determine if a PTE is present or swapped
out.
For illustration, we will show how this is implemented on the x86.
In the
swp_entry_t, two bits are always kept free. On the x86, Bit 0 is reserved for the
_PAGE_PRESENT ag and Bit 7 is reserved for _PAGE_PROTNONE. The requirement for
both bits is explained in Section 3.2. Bits 1-6 are for the type which is the index
within the swap_info array and are returned by the SWP_TYPE() macro.
Bits 8-31 are used are to store the oset within the swap_map from the
swp_entry_t. On the x86, this means 24 bits are available, limiting the size
of the swap area to 64GiB. The macro SWP_OFFSET() is used to extract the oset.
11.3 Allocating a swap slot
172
Figure 11.1: Storing Swap Entry Information in
To encode a type and oset into a
swp_entry_t
swp_entry_t, the macro SWP_ENTRY() is avail-
able which simply performs the relevant bit shifting operations.
The relationship
between all these macros is illustrated in Figure 11.1.
It should be noted that the six bits for type should allow up to 64 swap
areas to exist in a 32 bit architecture instead of the
of 32.
MAX_SWAPFILES
restriction
The restriction is due to the consumption of the vmalloc address space.
If a swap area is the maximum possible size then 32MiB is required for the
swap_map (224 ∗ sizeof(short)); remember that each page uses one short for the reference count.
For just
MAX_SWAPFILES
maximum number of swap areas to exist,
1GiB of virtual malloc space is required which is simply impossible because of the
user/kernel linear address space split.
This would imply supporting 64 swap areas is not worth the additional complexity but there are cases where a large number of swap areas would be desirable even
2
if the overall swap available does not increase. Some modern machines
have many
separate disks which between them can create a large number of separate block devices. In this case, it is desirable to create a large number of small swap areas which
are evenly distributed across all disks. This would allow a high degree of parallelism
in the page swapping behaviour which is important for swap intensive applications.
11.3 Allocating a swap slot
All page sized slots are tracked by the array
is of type
unsigned short.
swap_info_struct→swap_map
which
Each entry is a reference count of the number of users
of the slot which happens in the case of a shared page and is 0 when free. If the
2 A Sun E450 could have in the region of 20 disks in it for example.
11.4 Swap Cache
entry is
173
SWAP_MAP_MAX, the page is permanently reserved for that slot.
It is unlikely,
if not impossible, for this condition to occur but it exists to ensure the reference
count does not overow. If the entry is
SWAP_MAP_BAD,
Figure 11.2: Call Graph:
the slot is unusable.
get_swap_page()
The task of nding and allocating a swap entry is divided into two major tasks.
The rst performed by the high level function
swap_list→next,
get_swap_page().
Starting with
it searches swap areas for a suitable slot. Once a slot has been
found, it records what the next swap area to be used will be and returns the allocated
entry.
The task of searching the map is the responsibility of
scan_swap_map().
In
principle, it is very simple as it linearly scan the array for a free slot and return.
Predictably, the implementation is a bit more thorough.
Linux attempts to organise pages into
clusters
on disk of size
SWAPFILE_CLUSTER.
SWAPFILE_CLUSTER number of pages sequentially in swap keeping count
swap_info_struct→cluster_nr
records the current oset in swap_info_struct→cluster_next. Once a se-
It allocates
of the number of sequentially allocated pages in
and
quential block has been allocated, it searches for a block of free entries of size
SWAPFILE_CLUSTER.
If a block large enough can be found, it will be used as another
cluster sized sequence.
If no free clusters large enough can be found in the swap area, a simple rst-free
search starting from
swap_info_struct→lowest_bit
is performed. The aim is to
have pages swapped out at the same time close together on the premise that pages
swapped out together are related. This premise, which seems strange at rst glance,
is quite solid when it is considered that the page replacement algorithm will use swap
space most when linearly scanning the process address space swapping out pages.
Without scanning for large free blocks and using them, it is likely that the scanning
would degenerate to rst-free searches and never improve. With it, processes exiting
are likely to free up large blocks of slots.
11.4 Swap Cache
Pages that are shared between many processes can not be easily swapped out because, as mentioned, there is no quick way to map a
struct page to every PTE that
11.4 Swap Cache
174
references it. This leads to the race condition where a page is present for one PTE
and swapped out for another gets updated without being synced to disk thereby
losing the update.
To address this problem, shared pages that have a reserved slot in backing storage
are considered to be part of the
swap cache.
The swap cache is purely conceptual as
it is simply a specialisation of the page cache. The rst principal dierence between
pages in the swap cache rather than the page cache is that pages in the swap cache
swapper_space as their address_space in page→mapping. The second
dierence is that pages are added to the swap cache with add_to_swap_cache()
instead of add_to_page_cache().
always use
Figure 11.3: Call Graph:
add_to_swap_cache()
Anonymous pages are not part of the swap cache
swap them out. The variable
until
an attempt is made to
swapper_space is declared as follows in swap_state.c:
39 struct address_space swapper_space = {
40
LIST_HEAD_INIT(swapper_space.clean_pages),
41
LIST_HEAD_INIT(swapper_space.dirty_pages),
42
LIST_HEAD_INIT(swapper_space.locked_pages),
43
0,
44
&swap_aops,
45 };
page→mapping eld
PageSwapCache() macro.
A page is identied as being part of the swap cache once the
has been set to
swapper_space
which is tested by the
Linux uses the exact same code for keeping pages between swap and memory in
sync as it uses for keeping le-backed pages and memory in sync as they both share
the page cache code, the dierences are just in the functions used.
swapper_space uses swap_ops
page→index eld is then used to store
The address space for backing storage,
for
address_space→a_ops. The
the
swp_entry_t structure instead of a le oset which is it's normal purpose. The
address_space_operations struct swap_aops is declared as follows in swap_state.c:
it's
11.4 Swap Cache
175
34 static struct address_space_operations swap_aops = {
35
writepage: swap_writepage,
36
sync_page: block_sync_page,
37 };
When a page is being added to the swap cache, a slot is allocated with
get_swap_page(),
added to the page cache with
add_to_swap_cache()
and then
marked dirty. When the page is next laundered, it will actually be written to backing
storage on disk as the normal page cache would operate. This process is illustrated
in Figure 11.4.
Figure 11.4: Adding a Page to the Swap Cache
Subsequent swapping of the page from shared PTEs results in a call to
swap_duplicate() which simply increments the reference to the slot in the
swap_map. If the PTE is marked dirty by the hardware as a result of a write,
the bit is cleared and the struct page is marked dirty with set_page_dirty() so
that the on-disk copy will be synced before the page is dropped. This ensures that
until all references to the page have been dropped, a check will be made to ensure
the data on disk matches the data in the page frame.
When the reference count to the page nally reaches 0, the page is eligible to
be dropped from the page cache and the swap map count will have the count of
the number of PTEs the on-disk slot belongs to so that the slot will not be freed
prematurely. It is laundered and nally dropped with the same LRU aging and logic
described in Chapter 10.
11.5 Reading Pages from Backing Storage
176
If, on the other hand, a page fault occurs for a page that is swapped out, the
do_swap_page() will check to see if the page exists in the swap cache by
lookup_swap_cache(). If it does, the PTE is updated to point to the page
logic in
calling
frame, the page reference count incremented and the swap slot decremented with
swap_free().
swp_entry_t get_swap_page()
This function allocates a slot in a
swap_map
by searching active swap areas.
This is covered in greater detail in Section 11.3 but included here as it is principally
used in conjunction with the swap cache
int add_to_swap_cache(struct page *page, swp_entry_t entry)
This function adds a page to the swap cache. It rst checks if it already exists
by calling
swap_duplicate()
and if not, is adds it to the swap cache via the
normal page cache interface function
add_to_page_cache_unique()
struct page * lookup_swap_cache(swp_entry_t entry)
This searches the swap cache and returns the struct page corresponding
to the supplied entry. It works by searching the normal page cache based on
swapper_space and the swap_map oset
int swap_duplicate(swp_entry_t entry)
This function veries a swap entry is valid and if so, increments its swap map
count
void swap_free(swp_entry_t entry)
The complement function to swap_duplicate(). It decrements the relevant
counter in the swap_map. When the count reaches zero, the slot is eectively free
Table 11.1: Swap Cache API
11.5 Reading Pages from Backing Storage
The principal function used when reading in pages is
which is mainly called during page faulting.
read_swap_cache_async()
The function begins be searching
find_get_page(). Normally, swap cache searches are perlookup_swap_cache() but that function updates statistics on the num-
the swap cache with
formed by
ber of searches performed and as the cache may need to be searched multiple times,
find_get_page()
is used instead.
The page can already exist in the swap cache if another process has the same
page mapped or multiple processes are faulting on the same page at the same time.
If the page does not exist in the swap cache, one must be allocated and lled with
data from backing storage.
11.6 Writing Pages to Backing Storage
Figure 11.5: Call Graph:
Once the page is allocated with
with
add_to_swap_cache()
177
read_swap_cache_async()
alloc_page(),
it is added to the swap cache
as swap cache operations may only be performed on
pages in the swap cache. If the page cannot be added to the swap cache, the swap
cache will be searched again to make sure another process has not put the data in
the swap cache already.
rw_swap_page() is called which
discussed in Section 11.7. Once the function completes, page_cache_release()
called to drop the reference to the page taken by find_get_page().
To read information from backing storage,
is
is
11.6 Writing Pages to Backing Storage
When any page is being written to disk,
the
sulted to nd the appropriate write-out function.
address_space→a_ops
is con-
In the case of backing storage,
address_space is swapper_space and the swap operations
swap_aops. The struct swap_aops registers swap_writepage()
the
are contained in
as it's write-out
function.
The function
swap_writepage()
behaves dierently depending on whether the
writing process is the last user of the swap cache page or not.
calling
remove_exclusive_swap_page()
It knows this by
which checks if there is any other pro-
cesses using the page. This is a simple case of examining the page count with the
pagecache_lock
held. If no other process is mapping the page, it is removed from
the swap cache and freed.
remove_exclusive_swap_page() removed the page from the swap cache and
freed it swap_writepage() will unlock the page as it is no longer in use. If it still
exists in the swap cache, rw_swap_page() is called to write the data to the backing
If
storage.
11.7 Reading/Writing Swap Area Blocks
Figure 11.6: Call Graph:
178
sys_writepage()
11.7 Reading/Writing Swap Area Blocks
The top-level function for reading and writing to the swap area is
rw_swap_page().
This function ensures that all operations are performed through the swap cache to
prevent lost updates.
rw_swap_page_base()
is the core function which performs
the real work.
It begins by checking if the operation is a read. If it is, it clears the uptodate
ag with
ClearPageUptodate()
as the page is obviously not up to date if IO is
required to ll it with data. This ag will be set again if the page is successfully
read from disk. It then calls
get_swaphandle_info()
to acquire the device for the
swap partition of the inode for the swap le. These are required by the block layer
which will be performing the actual IO.
The core function can work with either swap partition or les as it uses the block
layer function
bmap()
brw_page()
to perform the actual disk IO. If the swap area is a le,
is used to ll a local array with a list of all blocks in the lesystem which
contain the page data. Remember that lesystems may have their own method of
storing les and disk and it is not as simple as the swap partition where information
may be written directly to disk.
If the backing storage is a partition, then only
one page-sized block requires IO and as there is no lesystem involved,
bmap()
is
unnecessary.
Once it is known what blocks must be read or written, a normal block IO operation takes place with
brw_page().
All IO that is performed is asynchronous so
the function returns quickly. Once the IO is complete, the block layer will unlock
the page and any waiting process will wake up.
11.8 Activating a Swap Area
179
11.8 Activating a Swap Area
As it has now been covered what swap areas are, how they are represented and
how pages are tracked, it is time to see how they all tie together to activate an
area. Activating an area is conceptually quite simple; Open the le, load the header
information from disk, populate a
swap_info_struct
and add it to the swap list.
The function responsible for the activation of a swap area is
sys_swapon() and it
takes two parameters, the path to the special le for the swap area and a set of ags.
While swap is been activated, the
Big Kernel Lock (BKL)
is held which prevents
any application entering kernel space while this operation is been performed. The
function is quite large but can be broken down into the following simple steps;
•
Find a free
swap_info_struct
in the
swap_info
array an initialise it with
default values
user_path_walk() which traverses the directory tree for the supplied
specialfile and populates a namidata structure with the available data on
the le, such as the dentry and the lesystem information for where it is
stored (vfsmount)
•
Call
•
Populate
swap_info_struct
area and how to nd it.
be congured to the
elds pertaining to the dimensions of the swap
If the swap area is a partition, the block size will
PAGE_SIZE
before calculating the size. If it is a le, the
information is obtained directly from the
•
inode
Ensure the area is not already activated. If not, allocate a page from memory and read the rst page sized slot from the swap area.
This page con-
tains information such as the number of good slots and how to populate the
swap_info_struct→swap_map
•
with the bad entries
vmalloc() for swap_info_struct→swap_map and iniwith 0 for good slots and SWAP_MAP_BAD otherwise. Ideally
Allocate memory with
tialise each entry
the header information will be a version 2 le format as version 1 was limited
to swap areas of just under 128MiB for architectures with 4KiB page sizes like
the x86
•
3
After ensuring the information indicated in the header matches the actual
swap area, ll in the remaining information in the
swap_info_struct
such
as the maximum number of pages and the available good pages. Update the
global statistics for
•
nr_swap_pages
and
total_swap_pages
The swap area is now fully active and initialised and so it is inserted into the
swap list in the correct position based on priority of the newly activated area
At the end of the function, the BKL is released and the system now has a new
swap area available for paging to.
3 See the Code Commentary for the comprehensive reason for this.
11.9 Deactivating a Swap Area
180
11.9 Deactivating a Swap Area
In comparison to activating a swap area, deactivation is incredibly expensive. The
principal problem is that the area cannot be simply removed, every page that is
swapped out must now be swapped back in again. Just as there is no quick way
of mapping a
struct page
to every PTE that references it, there is no quick way
to map a swap entry to a PTE either.
This requires that all process page tables
be traversed to nd PTEs which reference the swap area to be deactivated and
swap them in. This of course means that swap deactivation will fail if the physical
memory is not available.
The function responsible for deactivating an area is, predictably enough, called
sys_swapoff().
This function is mainly concerned with updating the
swap_info_struct.
try_to_unuse()
The major task of paging in each paged-out page is the responsibility of
which is
extremely
expensive. For each slot used in the
swap_map,
the page tables
for processes have to be traversed searching for it. In the worst case, all page tables
belonging to all
mm_structs
may have to be traversed. Therefore, the tasks taken
for deactivating an area are broadly speaking;
•
Call
user_path_walk() to acquire the information about the special le to be
deactivated and then take the BKL
•
Remove the
swap_info_struct
from the swap list and update the global
statistics on the number of swap pages available (nr_swap_pages) and the
total number of swap entries (total_swap_pages. Once this is acquired, the
BKL can be released again
•
try_to_unuse() which will page in all pages from the swap area to be deactivated. This function loops through the swap map using find_next_to_unuse()
Call
to locate the next used swap slot. For each used slot it nds, it performs the
following;
Call
read_swap_cache_async()
to allocate a page for the slot saved on
disk. Ideally it exists in the swap cache already but the page allocator
will be called if it is not
Wait on the page to be fully paged in and lock it.
Once locked, call
unuse_process() for every process that has a PTE referencing the page.
This function traverses the page table searching for the relevant PTE
struct page. If the page is a shared
reference, shmem_unuse() is called in-
and then updates it to point to the
memory page with no remaining
stead
Free all slots that were permanently mapped. It is believed that slots will
never become permanently reserved so the risk is taken.
Delete the page from the swap cache to prevent
try_to_swap_out()
referencing a page in the event it still somehow has a reference in swap
map
11.10 Whats New in 2.6
•
181
If there was not enough available memory to page in all the entries, the swap
area is reinserted back into the running system as it cannot be simply dropped.
swap_info_struct is placed into an uninitialised state and
swap_map memory freed with vfree()
If it succeeded, the
the
11.10 Whats New in 2.6
The most important addition to the
a linked list called
extent_list
and
struct swap_info_struct is the addition of
a cache eld called curr_swap_extent for the
implementation of extents.
Extents, which are represented by a
struct swap_extent,
map a contiguous
range of pages in the swap area into a contiguous range of disk blocks.
extents are setup at swapon time by the function
setup_swap_extents().
These
For block
devices, there will only be one swap extent and it will not improve performance but
the extent it setup so that swap areas backed by block devices or regular les can
be treated the same.
It can make a large dierence with swap les which will have multiple extents representing ranges of pages clustered together in blocks. When searching for the page
at a particular oset, the extent list will be traversed. To improve search times, the
last extent that was searched will be cached in
swap_extent→curr_swap_extent.
Chapter 12
Shared Memory Virtual Filesystem
Sharing a region region of memory backed by a le or device is simply a case of
mmap()
calling
with the
MAP_SHARED
ag. However, there are two important cases
where an anonymous region needs to be shared between processes. The rst is when
mmap()
with
MAP_SHARED
but no le backing. These regions will be shared between
fork() is executed. The second is when a region is
explicitly setting them up with shmget() and attached to the virtual address space
with shmat().
a parent and child process after a
When pages within a VMA are backed by a le on disk, the interface used
To read a page during a page fault, the required nopage()
vm_area_struct→vm_ops. To write a page to backing storage,
the appropriate writepage() function is found in the address_space_operations
via inode→i_mapping→a_ops or alternatively via page→mapping→a_ops. When
normal le operations are taking place such as mmap(), read() and write(), the
struct file_operations with the appropriate functions is found via inode→i_fop
is straight-forward.
function is found
and so on. These relationships were illustrated in Figure 4.2.
This is a very clean interface that is conceptually easy to understand but it
does not help anonymous pages as there is no le backing. To keep this nice interface, Linux creates an artical le-backing for anonymous pages using a RAM-based
lesystem where each VMA is backed by a le in this lesystem. Every inode in
the lesystem is placed on a linked list called
shmem_inodes
so that they may al-
ways be easily located. This allows the same le-based interface to be used without
treating anonymous pages as a special case.
The lesystem comes in two variations called
shm
and
core functionality and mainly dier in what they are used
tmpfs . They both share
for. shm is for use by the
kernel for creating le backings for anonymous pages and for backing regions created
by
shmget().
This lesystem is mounted by
kern_mount()
so that it is mounted
tmpfs is a temporary lesystem that may be
/tmp/ to have a fast RAM-based temporary lesystem. A
tmpfs is to mount it on /dev/shm/. Processes that mmap() les
internally and not visible to users.
optionally mounted on
secondary use for
in the
tmpfs
lesystem will be able to share information between them as an alter-
native to System V IPC mechanisms. Regardless of the type of use,
explicitly mounted by the system administrator.
182
tmpfs
must be
12.1 Initialising the Virtual Filesystem
183
This chapter begins with a description of how the virtual lesystem is implemented.
From there we will discuss how shared regions are setup and destroyed
before talking about how the tools are used to implement System V IPC mechanisms.
12.1 Initialising the Virtual Filesystem
The virtual lesystem is initialised by the function
init_tmpfs() during either sys-
tem start or when the module is begin loaded. This function registers the two lesystems,
tmpfs
and
shm,
mounts
shm
as an internal lesystem with
kern_mount().
It
then calculates the maximum number of blocks and inodes that can exist in the
lesystems. As part of the registration, the function
as a callback to populate a
struct super_block
shmem_read_super()
is used
with more information about the
lesystems such as making the block size equal to the page size.
Figure 12.1: Call Graph:
init_tmpfs()
Every inode created in the lesystem will have a
struct shmem_inode_info
associated with it which contains private information specic to the lesystem. The
SHMEM_I() takes an inode as a parameter and returns a pointer to a struct
type. It is declared as follows in <linux/shmem_fs.h>:
function
of this
20 struct shmem_inode_info {
21
spinlock_t
lock;
22
unsigned long
next_index;
23
swp_entry_t
i_direct[SHMEM_NR_DIRECT];
24
void
**i_indirect;
25
unsigned long
swapped;
26
unsigned long
flags;
27
struct list_head
list;
28
struct inode
*inode;
29 };
The elds are:
12.2 Using shmem Functions
lock
184
is a spinlock protecting the inode information from concurrent accessses
next_index
is an index of the last page being used in the le.
dierent from
i_direct
inode→i_size
This will be
while a le is being trucated
is a direct block containing the rst
SHMEM_NR_DIRECT
swap vectors in
use by the le. See Section 12.4.1.
i_indirect
swapped
is a pointer to the rst indirect block. See Section 12.4.1.
is a count of the number of pages belonging to the le that are currently
swapped out
ags
is currently only used to remember if the le belongs to a shared region setup
by
shmget().
by specifying
list
It is set by specifying
SHM_UNLOCK
SHM_LOCK
with
shmctl()
and unlocked
is a list of all inodes used by the lesystem
inode
is a pointer to the parent inode
12.2 Using shmem Functions
Dierent structs contain pointers for shmem specic functions. In all cases,
and
shm
tmpfs
share the same structs.
For faulting in pages and writing them to backing storage, two structs called
shmem_aops and shmem_vm_ops of type struct address_space_operations and
struct vm_operations_struct respectively are declared.
The address space operations struct shmem_aops contains pointers to a small
number of functions of which the most important one is shmem_writepage()
which is called when a page is moved from the page cache to the swap cache.
shmem_removepage()
is called when a page is removed from the page cache so
shmem_readpage() is not used by tmpfs but
is provided so that the sendfile() system call my be used with tmpfs les.
shmem_prepare_write() and shmem_commit_write() are also unused, but are provided so that tmpfs can be used with the loopback device. shmem_aops is declared
as follows in mm/shmem.c
that the block can be reclaimed.
1500
1501
1502
1503
1504
1505
1506
1507
1508
static struct address_space_operations shmem_aops = {
removepage:
shmem_removepage,
writepage:
shmem_writepage,
#ifdef CONFIG_TMPFS
readpage:
shmem_readpage,
prepare_write: shmem_prepare_write,
commit_write: shmem_commit_write,
#endif
};
12.2 Using shmem Functions
Anonymous VMAs use
shmem_nopage()
185
shmem_vm_ops
as it's
vm_operations_struct
is called when a new page is being faulted in.
so that
It is declared as
follows:
1426 static struct vm_operations_struct shmem_vm_ops = {
1427
nopage: shmem_nopage,
1428 };
file_operations and
inode_operations are required. The file_operations, called shmem_file_operations,
provides functions which implement mmap(), read(), write() and fsync(). It is
To perform operations on les and inodes, two structs,
declared as follows:
1510
1511
1512
1513
1514
1515
1516
1517
static struct file_operations shmem_file_operations = {
mmap:
shmem_mmap,
#ifdef CONFIG_TMPFS
read:
shmem_file_read,
write:
shmem_file_write,
fsync:
shmem_sync_file,
#endif
};
shmem_inode_operations
which is used for le inodes. The second, called shmem_dir_inode_operations
is for directories. The last pair, called shmem_symlink_inline_operations and
shmem_symlink_inode_operations is for use with symbolic links.
The two le operations supported are truncate() and setattr() which are
stored in a struct inode_operations called shmem_inode_operations. shmem_truncate()
is used to truncate a le.
shmem_notify_change() is called when the le atThree sets of
inode_operations are provided.
The rst is
tributes change. This allows, amoung other things, to allows a le to be grown with
truncate() and use the global zero page as the data page. shmem_inode_operations
is declared as follows:
1519 static struct inode_operations shmem_inode_operations = {
1520
truncate:
shmem_truncate,
1521
setattr:
shmem_notify_change,
1522 };
The directory
and
mkdir().
inode_operations
provides functions such as
They are declared as follows:
create(), link()
12.2 Using shmem Functions
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
186
static struct inode_operations shmem_dir_inode_operations = {
#ifdef CONFIG_TMPFS
create:
shmem_create,
lookup:
shmem_lookup,
link:
shmem_link,
unlink:
shmem_unlink,
symlink:
shmem_symlink,
mkdir:
shmem_mkdir,
rmdir:
shmem_rmdir,
mknod:
shmem_mknod,
rename:
shmem_rename,
#endif
};
The last pair of operations are for use with symlinks. They are declared as:
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
static struct inode_operations shmem_symlink_inline_operations = {
readlink:
shmem_readlink_inline,
follow_link:
shmem_follow_link_inline,
};
static struct inode_operations shmem_symlink_inode_operations = {
truncate:
shmem_truncate,
readlink:
shmem_readlink,
follow_link:
shmem_follow_link,
};
The dierence between the two
readlink()
and
follow_link()
functions is
related to where the link information is stored. A symlink inode does not require the
private inode information
struct shmem_inode_information.
If the length of the
symbolic link name is smaller than this struct, the space in the inode is used to store
shmem_symlink_inline_operations becomes the inode operations
struct. Otherwise a page is allocated with shmem_getpage(), the symbolic link is
copied to it and shmem_symlink_inode_operations is used. The second struct
includes a truncate() function so that the page will be reclaimed when the le is
the name and
deleted.
These various structs ensure that the shmem equivalent of inode related operations will be used when regions are backed by virtual les. When they are used, the
majority of the VM sees no dierence between pages backed by a real le and ones
backed by virtual les.
12.3 Creating Files in tmpfs
187
12.3 Creating Files in tmpfs
tmpfs is mounted as a proper lesystem that is visible to the user, it must support
directory inode operations such as open(), mkdir() and link(). Pointers to functions which implement these for tmpfs are provided in shmem_dir_inode_operations
As
which was shown in Section 12.2.
The implementations of most of these functions are quite small and, at some
level, they are all interconnected as can be seen from Figure 12.2.
All of them
share the same basic principal of performing some work with inodes in the virtual
lesystem and the majority of the inode elds are lled in by
Figure 12.2: Call Graph:
shmem_get_inode().
shmem_create()
When creating a new le, the top-level function called is
This small function calls
shmem_create().
shmem_mknod() with the S_IFREG ag added so that
shmem_mknod() is little more than a wrapper
a regular le will be created.
12.4 Page Faulting within a Virtual File
around the
shmem_get_inode()
in the struct elds.
188
which, predictably, creates a new inode and lls
The three elds of principal interest that are lled are the
inode→i_mapping→a_ops, inode→i_op and inode→i_fop elds. Once the inode has been created, shmem_mknod() updates the directory inode size and mtime
statistics before instantiating the new inode.
Files are created dierently in
shm
even though the lesystems are essentially
identical in functionality. How these les are created is covered later in Section 12.7.
12.4 Page Faulting within a Virtual File
When a page fault occurs,
do_no_page() will call vma→vm_ops→nopage if it exists.
shmem_nopage(), whose
In the case of the virtual lesystem, this means the function
call graph is shown in Figure 12.3, will be called when a page fault occurs.
Figure 12.3: Call Graph:
The core function in this case is
shmem_nopage()
shmem_getpage() which is responsible for either
allocating a new page or nding it in swap. This overloading of fault types is unusual
as
do_swap_page() is normally responsible for locating pages that have been moved
to the swap cache or backing storage using information encoded within the PTE. In
this case, pages backed by virtual les have their PTE set to 0 when they are moved
to the swap cache.
The inode's private lesystem data stores direct and indirect
block information which is used to locate the pages later.
This operation is very
similar in many respects to normal page faulting.
12.4.1 Locating Swapped Pages
When a page has been swapped out, a
swp_entry_t will contain information needed
to locate the page again. Instead of using the PTEs for this task, the information
is stored within the lesystem-specic private information in the inode.
shmem_alloc_entry().
It's basic task is to perform basic checks and ensure that shmem_inode_info→next_index
When faulting, the function called to locate the swap entry is
always points to the page index at the end of the virtual le. It's principal task is
shmem_swp_entry() which searches for the swap vector within the inode
information with shmem_swp_entry() and allocate new pages as necessary to store
to call
swap vectors.
inode→i_direct. This means
that for the x86, les that are smaller than 64KiB (SHMEM_NR_DIRECT * PAGE_SIZE)
The rst
SHMEM_NR_DIRECT
entries are stored in
12.4.2 Writing Pages to Swap
189
will not need to use indirect blocks. Larger les must use indirect blocks starting
with the one located at
inode→i_indirect.
Figure 12.4: Traversing Indirect Blocks in a Virtual File
The initial indirect block (inode→i_indirect) is broken into two halves. The
rst half contains pointers to doubly indirect blocks and the second half contains
pointers to triply indirect blocks. The doubly indirect blocks are pages containing
swap vectors (swp_entry_t). The triple indirect blocks contain pointers to pages
which in turn are lled with swap vectors. The relationship between the dierent
levels of indirect blocks is illustrated in Figure 12.4. The relationship means that
the maximum number of pages in a virtual le (SHMEM_MAX_INDEX) is dened as
follows in
mm/shmem.c:
44 #define SHMEM_MAX_INDEX (
SHMEM_NR_DIRECT +
(ENTRIES_PER_PAGEPAGE/2) *
(ENTRIES_PER_PAGE+1))
12.4.2 Writing Pages to Swap
shmem_writepage() is the registered function in the lesystems
address_space_operations for writing pages to swap. The function is respon-
The function
sible for simply moving the page from the page cache to the swap cache. This is
implemented with a few simple steps:
12.5 File Operations in tmpfs
190
page→mapping
•
Record the current
•
Allocate a free slot in the backing storage with
•
Allocate a
•
Remove the page from the page cache
•
Add the page to the swap cache. If it fails, free the swap slot, add back to the
swp_entry_t
with
and information about the inode
get_swap_page()
shmem_swp_entry()
page cache and try again
12.5 File Operations in tmpfs
mmap(), read(), write() and fsync() are supported with virtual
to the functions are stored in shmem_file_operations which was
Four operations,
les.
Pointers
shown in Section 12.2.
There is little that is unusual in the implementation of these operations and
mmap() operation is imshmem_mmap() and it simply updates the VMA that is managing the
region. read(), implemented by shmem_read(), performs the operation
they are covered in detail in the Code Commentary. The
plemented by
mapped
of copying bytes from the virtual le to a userspace buer, faulting in pages as
shmem_write() is essentially the same. The
fsync() operation is implemented by shmem_file_sync() but is essentially a NULL
necessary.
write(),
implemented by
operation as it performs no task and simply returns 0 for success. As the les only
exist in RAM, they do not need to be synchronised with any disk.
12.6 Inode Operations in tmpfs
The most complex operation that is supported for inodes is truncation and involves
shmem_truncate() will truncate the a partial page
at the end of the le and continually calls shmem_truncate_indirect() until the
le is truncated to the proper size. Each call to shmem_truncate_indirect() will
four distinct stages. The rst, in
only process one indirect block at each pass which is why it may need to be called
multiple times.
The second stage, in
shmem_truncate_indirect(),
understands both doubly
and triply indirect blocks. It nds the next indirect block that needs to be truncated.
This indirect block, which is passed to the third stage, will contain pointers to pages
which in turn contain swap vectors.
The third stage in
tain swap vectors.
shmem_truncate_direct()
works with pages that con-
It selects a range that needs to be truncated and passes
the range to the last stage
free_swap_and_cache()
shmem_swp_free().
The last stage frees entries with
which frees both the swap entry and the page containing
data.
The linking and unlinking of les is very simple as most of the work is performed
by the lesystem layer. To link a le, the directory inode size is incremented, the
12.7 Setting up Shared Regions
ctime
and
mtime
191
of the aected inodes is updated and the number of links to the
inode being linked to is incremented. A reference to the new dentry is then taken
with
dget() before instantiating the new dentry with d_instantiate().
Unlinking
updates the same inode statistics before decrementing the reference to the
with
dput(). dput()
will also call
iput()
dentry
which will clear up the inode when it's
reference count hits zero.
shmem_mkdir() to perform the task. It simply
uses shmem_mknod() with the S_IFDIR ag before incrementing the parent directory
inode's i_nlink counter. The function shmem_rmdir() will delete a directory by rst
ensuring it is empty with shmem_empty(). If it is, the function then decrementing
the parent directory inode's i_nlink count and calls shmem_unlink() to remove
Creating a directory will use
the requested directory.
12.7 Setting up Shared Regions
A shared region is backed by a le created in
shm.
There are two cases where a new
shmget() and when an
mmap() with the MAP_SHARED ag. Both functions
shmem_file_setup() to create a le.
le will be created, during the setup of a shared region with
anonymous region is setup with
use the core function
Figure 12.5: Call Graph:
As the lesystem is internal,
shmem_zero_setup()
the names of the les created do not have
to be unique as the les are always located by inode, not name.
Therefore,
shmem_zero_setup() always says to create a le called dev/zero which is how it
shows up in the le /proc/pid/maps. Files created by shmget() are called SYSVNN
where the NN is the key that is passed as a parameter to shmget().
The core function shmem_file_setup() simply creates a new dentry and inode,
lls in the relevant elds and instantiates them.
12.8 System V IPC
192
12.8 System V IPC
The full internals of the IPC implementation is beyond the scope of this book.
shmget() and shmat() and
shmget() is implemented by
This section will focus just on the implementations of
how they are aected by the VM. The system call
sys_shmget().
It performs basic checks to the parameters and sets up the IPC
related data structures.
newseg().
shmem_file_setup() as
To create the segment, it calls
function that creates the le in
shmfs
with
This is the
discussed in
the previous section.
Figure 12.6: Call Graph:
The system call
shmat()
sys_shmget()
is implemented by
sys_shmat().
There is little re-
markable about the function. It acquires the appropriate descriptor and makes sure
all the parameters are valid before calling
do_mmap() to map the shared region into
the process address space. There are only two points of note in the function.
The rst is that it is responsible for ensuring that VMAs will not overlap if the
shp→shm_nattch counter is
maintained by a vm_operations_struct() called shm_vm_ops. It registers open()
and close() callbacks called shm_open() and shm_close() respectively.
The
shm_close() callback is also responsible for destroyed shared regions if the SHM_DEST
ag is specied and the shm_nattch counter reaches zero.
caller species the address.
The second is that the
12.9 What's New in 2.6
The core concept and functionality of the lesystem remains the same and the
changes are either optimisations or extensions to the lesystem's functionality. If
the reader understands the 2.4 implementation well, the 2.6 implementation will not
1
present much trouble .
A new elds have been added to the
alloced
shmem_inode_info
called
alloced.
The
eld stores how many data pages are allocated to the le which had to be
calculated on the y in 2.4 based on
inode→i_blocks.
It both saves a few clock
cycles on a common operation as well as making the code a bit more readable.
1 I nd that saying How hard could it possibly be always helps.
12.9 What's New in 2.6
193
The flags eld now uses the VM_ACCOUNT ag as well as the VM_LOCKED ag. The
VM_ACCOUNT, always set, means that the VM will carefully account for the amount
of memory used to make sure that allocations will not fail.
Extensions to the le operations are the ability to seek with the system call
_llseek(),
generic_file_llseek() and
by shmem_file_sendfile().
implemented by
virtual les, implemented
to use
sendfile()
with
An extension has been
added to the VMA operations to allow non-linear mappings, implemented by
shmem_populate().
The last major change is that the lesystem is responsible for the allocation and
struct super_operations.
It is simply implemented by the creation of a slab cache called shmem_inode_cache.
A constructor function init_once() is registered for the slab allocator to use for
destruction of it's own inodes which are two new callbacks in
initialising each new inode.
Chapter 13
Out Of Memory Management
The last aspect of the VM we are going to discuss is the Out Of Memory (OOM)
manager. This intentionally is a very short chapter as it has one simple task; check
if there is enough available memory to satisfy, verify that the system is truely out of
memory and if so, select a process to kill. This is a controversial part of the VM and
it has been suggested that it be removed on many occasions. Regardless of whether
it exists in the latest kernel, it still is a useful system to examine as it touches o a
number of other subsystems.
13.1 Checking Available Memory
For certain operations, such as expaning the heap with
address space with
mremap(),
memory to satisfy a request.
brk()
or remapping an
the system will check if there is enough available
Note that this is separate to the
out_of_memory()
path that is covered in the next section. This path is used to avoid the system being
in a state of OOM if at all possible.
When checking available memory, the number of required pages is passed as a
parameter to
vm_enough_memory().
Unless the system administrator has specied
that the system should overcommit memory, the mount of available memory will be
checked.
To determine how many pages are potentially available, Linux sums up
the following bits of data:
Total page cache
as page cache is easily reclaimed
Total free pages
because they are already available
Total free swap pages
as userspace pages may be paged out
Total pages managed by swapper_space
although this double-counts the free
swap pages. This is balanced by the fact that slots are sometimes reserved but
not used
Total pages used by the dentry cache
194
as they are easily reclaimed
13.2 Determining OOM Status
Total pages used by the inode cache
195
as they are easily reclaimed
If the total number of pages added here is sucient for the request,
vm_enough_memory()
returns true to the caller. If false is returned, the caller knows that the memory is
not available and usually decides to return
-ENOMEM
to userspace.
13.2 Determining OOM Status
When the machine is low on memory, old page frames will be reclaimed (see
Chapter 10) but despite reclaiming pages is may nd that it was unable to free
enough pages to satisfy a request even when scanning at highest priority. If it does
fail to free page frames,
out_of_memory()
is called to see if the system is out of
memory and needs to kill a process.
Figure 13.1: Call Graph:
out_of_memory()
Unfortunately, it is possible that the system is not out memory and simply needs
to wait for IO to complete or for pages to be swapped to backing storage. This is
unfortunate, not because the system has memory, but because the function is being
called unnecessarily opening the possibly of processes being unnecessarily killed.
Before deciding to kill a process, it goes through the following checklist.
•
Is there enough swap space left (nr_swap_pages > 0) ? If yes, not OOM
13.3 Selecting a Process
196
•
Has it been more than 5 seconds since the last failure? If yes, not OOM
•
Have we failed within the last second? If no, not OOM
•
If there hasn't been 10 failures at least in the last 5 seconds, we're not OOM
•
Has a process been killed within the last 5 seconds? If yes, not OOM
It is only if the above tests are passed that
oom_kill()
is called to select a
process to kill.
13.3 Selecting a Process
The function
select_bad_process()
is responsible for choosing a process to kill.
It decides by stepping through each running task and calculating how suitable it is
for killing with the function
badness().
The badness is calculated as follows, note
that the square roots are integer approximations calculated with
int_sqrt();
total_vm_for_task
q
badness_for_task = q
(cpu_time_in_seconds) ∗ 4 (cpu_time_in_minutes)
This has been chosen to select a process that is using a large amount of memory
but is not that long lived.
Processes which have been running a long time are
unlikely to be the cause of memory shortage so this calculation is likely to select a
process that uses a lot of memory but has not been running long.
is a root process or has
CAP_SYS_ADMIN
If the process
capabilities, the points are divided by four
as it is assumed that root privilege processes are well behaved. Similarly, if it has
CAP_SYS_RAWIO capabilities (access to raw devices) privileges, the points are further
divided by 4 as it is undesirable to kill a process that has direct access to hardware.
13.4 Killing the Selected Process
Once a task is selected, the list is walked again and each process that shares the
mm_struct as the selected process (i.e. they are threads) is sent a signal. If
CAP_SYS_RAWIO capabilities, a SIGTERM is sent to give the process a
chance of exiting cleanly, otherwise a SIGKILL is sent.
same
the process has
13.5 Is That It?
Yes, thats it, out of memory management touches a lot of subsystems otherwise,
there is not much to it.
13.6 What's New in 2.6
197
13.6 What's New in 2.6
The majority of OOM management remains essentially the same for 2.6 except for
the introduction of VM accounted objects. These are VMAs that are agged with
the
VM_ACCOUNT ag, rst mentioned in Section 4.8.
Additional checks will be made
to ensure there is memory available when performing operations on VMAs with this
ag set. The principal incentive for this complexity is to avoid the need of an OOM
killer.
VM_ACCOUNT ag
with MAP_SHARED,
Some regions which always have the
the process heap, regions
mmap()ed
shmget().
writable and regions set up
have the
VM_ACCOUNT
set are the process stack,
private regions that are
In other words, most userspace mappings
ag set.
Linux accounts for the amount of memory that is committed to these VMAs with
vm_acct_memory() which increments a variable called committed_space. When the
VMA is freed, the committed space is decremented with vm_unacct_memory(). This
is a fairly simple mechanism, but it allows Linux to remember how much memory
it has already committed to userspace when deciding if it should commit more.
The checks are performed by calling
troduces us to another new feature.
security_vm_enough_memory()
which in-
2.6 has a feature available which allows se-
curity related kernel modules to override certain kernel functions. The full list of
hooks available is stored in a
struct security_operations
called
security_ops.
There are a number of dummy, or default, functions that may be used which are
all listed in
security/dummy.c
but the majority do nothing except return. If there
are no security modules loaded, the
security_operations
struct used is called
dummy_security_ops which uses all the default function.
By default, security_vm_enough_memory() calls dummy_vm_enough_memory()
which is declared in security/dummy.c and is very similar to 2.4's vm_enough_memory()
function. The new version adds the following pieces of information together to determine available memory:
Total page cache
as page cache is easily reclaimed
Total free pages
because they are already available
Total free swap pages
as userspace pages may be paged out
Slab pages with SLAB_RECLAIM_ACCOUNT set
as they are easily reclaimed
These pages, minus a 3% reserve for root processes, is the total amount of
memory that is available for the request.
If the memory is available, it makes a
check to ensure the total amount of committed memory does not exceed the allowed threshold. The allowed threshold is
TotalRam * (OverCommitRatio/100) +
TotalSwapPage, where OverCommitRatio is set by the system administrator.
If the
total amount of committed space is not too high, 1 will be returned so that the
allocation can proceed.
Chapter 14
The Final Word
Make no mistake, memory management is a large, complex and time consuming eld
to research and dicult to apply to practical implementations. As it is very dicult
to model how systems behave in real multi-programmed systems [CD80], developers
often rely on intuition to guide them and examination of virtual memory algorithms
depends on simulations of specic workloads.
Simulations are necessary as mod-
eling how scheduling, paging behaviour and multiple processes interact presents a
considerable challenge. Page replacement policies, a eld that has been the focus
of considerable amounts of research, is a good example as it is only ever shown to
work well for specied workloads. The problem of adjusting algorithms and policies
to dierent workloads is addressed by having administrators tune systems as much
as by research and algorithms.
The Linux kernel is also large, complex and fully understood by a relatively small
core group of people. It's development is the result of contributions of thousands
of programmers with a varying range of specialties, backgrounds and spare time.
The rst implementations are developed based on the all-important foundation that
theory provides. Contributors built upon this framework with changes based on real
world observations.
It has been asserted on the Linux Memory Management mailing list that the VM
is poorly documented and dicult to pick up as the implementation is a nightmare
to follow
to Linux.
1
and the lack of documentation on practical VMs is not just conned
Matt Dillon, one of the principal developers of the FreeBSD VM
3
considered a VM Guru stated in an interview
2
and
that documentation can be hard
to come by. One of the principal diculties with deciphering the implementation
is the fact the developer must have a background in memory management theory to
see why implementation decisions were made as a pure understanding of the code
is insucient for any purpose other than micro-optimisations.
This book attempted to bridge the gap between memory management theory
and the practical implementation in Linux and tie both elds together in a single
1 http://mail.nl.linux.org/linux-mm/2002-05/msg00035.html
2 His past involvement with the Linux VM is evident from http://mail.nl.linux.org/linuxmm/2000-05/msg00419.html
3 http://kerneltrap.com/node.php?id=8
198
CHAPTER 14. THE FINAL WORD
place.
199
It tried to describe what life is like in Linux as a memory manager in a
manner that was relatively independent of hardware architecture considerations. I
hope after reading this, and progressing onto the code commentary, that you, the
reader feels a lot more comfortable with tackling the VM subsystem.
As a nal
parting shot, Figure 14.1 broadly illustrates how of the sub-systems we discussed in
detail interact with each other.
On a nal personal note, I hope that this book encourages other people to produce similar works for other areas of the kernel. I know I'll buy them!
Figure 14.1: Broad Overview on how VM Sub-Systems Interact
Appendix A
Introduction
Welcome to the code commentary section of the book. If you are reading this, you
are looking for a heavily detailed tour of the code. The commentary presumes you
have read the equivilant section in the main part of the book so if you just started
reading here, you're probably in the wrong place.
Each appendix section corresponds to the order and structure as the book. The
order the functions are presented is the same order as displayed in the call graphs
which are referenced throughout the commentary. At the beginning of each appendix
and subsection, there is a mini table of contents to help navigate your way through
the commentary. The code coverage is not 100% but all the principal code patterns
that are found throughout the VM may be found. If the function you are interested
in is not commented on, try and nd a similar function to it.
Some of the code has been reformatted slightly for presentation but the actual
code is not changed. It is recommended you use the companion CD while reading
the code commentary. In particular use LXR to browse through the source code so
you get a feel for reading the code with and without the aid of the commentary.
Good Luck!
200
Appendix B
Describing Physical Memory
Contents
B.1 Initialising Zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
B.1.1
B.1.2
B.1.3
B.1.4
B.1.5
B.1.6
Function:
Function:
Function:
Function:
Function:
Function:
setup_memory() . . . . . . . . . . . . . . . . . . . . . 202
zone_sizes_init() . . . . . . . . . . . . . . . . . . . 205
free_area_init() . . . . . . . . . . . . . . . . . . . . 206
free_area_init_node() . . . . . . . . . . . . . . . . . 206
free_area_init_core() . . . . . . . . . . . . . . . . . 208
build_zonelists() . . . . . . . . . . . . . . . . . . . 214
B.2 Page Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
B.2.1 Locking Pages . . . . . . . . . . . . . . .
B.2.1.1 Function: lock_page() . . . .
B.2.1.2 Function: __lock_page() . . .
B.2.1.3 Function: sync_page() . . . .
B.2.2 Unlocking Pages . . . . . . . . . . . . .
B.2.2.1 Function: unlock_page() . . .
B.2.3 Waiting on Pages . . . . . . . . . . . . .
B.2.3.1 Function: wait_on_page() . .
B.2.3.2 Function: ___wait_on_page()
201
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
216
216
216
217
218
218
219
219
219
B.1 Initialising Zones
202
B.1 Initialising Zones
Contents
B.1 Initialising Zones
B.1.1 Function: setup_memory()
B.1.2 Function: zone_sizes_init()
B.1.3 Function: free_area_init()
B.1.4 Function: free_area_init_node()
B.1.5 Function: free_area_init_core()
B.1.6 Function: build_zonelists()
B.1.1
Function: setup_memory()
202
202
205
206
206
208
214
(arch/i386/kernel/setup.c)
The call graph for this function is shown in Figure 2.3. This function gets the
necessary information to give to the boot memory allocator to initialise itself. It is
broken up into a number of dierent tasks.
•
Find the start and ending PFN for low memory (min_low_pfn,
the
the
•
max_low_pfn),
start and end PFN for high memory (highstart_pfn, highend_pfn) and
PFN for the last page in the system (max_pfn).
Initialise the
bootmem_data
structure and declare which pages may be used
by the boot memory allocator
•
Mark all pages usable by the system as free and then reserve the pages used
by the bitmap representing the pages
•
Reserve pages used by the SMP cong or the initrd image if one exists
991 static unsigned long __init setup_memory(void)
992 {
993
unsigned long bootmap_size, start_pfn, max_low_pfn;
994
995
/*
996
* partially used pages are not usable - thus
997
* we are rounding upwards:
998
*/
999
start_pfn = PFN_UP(__pa(&_end));
1000
1001
find_max_pfn();
1002
1003
max_low_pfn = find_max_low_pfn();
1004
1005 #ifdef CONFIG_HIGHMEM
1006
highstart_pfn = highend_pfn = max_pfn;
1007
if (max_pfn > max_low_pfn) {
1008
highstart_pfn = max_low_pfn;
B.1 Initialising Zones (setup_memory())
203
1009
}
1010
printk(KERN_NOTICE "%ldMB HIGHMEM available.\n",
1011
pages_to_mb(highend_pfn - highstart_pfn));
1012 #endif
1013
printk(KERN_NOTICE "%ldMB LOWMEM available.\n",
1014
pages_to_mb(max_low_pfn));
999 PFN_UP() takes a physical address, rounds it up to the next page and returns
_end is the address of the end of the loaded kernel
start_pfn is now the oset of the rst physical page frame that may
the page frame number.
image so
be used
1001 find_max_pfn() loops through the e820 map searching for the highest available pfn
1003 find_max_low_pfn() nds the highest page frame addressable in ZONE_NORMAL
1005-1011 If high memory is enabled, start with a high memory region of 0.
turns out there is memory after
max_low_pfn,
If it
put the start of high memory
(highstart_pfn) there and the end of high memory at
max_pfn.
Print out an
informational message on the availability of high memory
1013-1014 Print out an informational message on the amount of low memory
1018
1019
1020
1021
1028
1029
1030
1035
1036
1037
1043
1044
1045
1046
1047
1048
1049
1050
bootmap_size = init_bootmem(start_pfn, max_low_pfn);
register_bootmem_low_pages(max_low_pfn);
reserve_bootmem(HIGH_MEMORY, (PFN_PHYS(start_pfn) +
bootmap_size + PAGE_SIZE-1) - (HIGH_MEMORY));
reserve_bootmem(0, PAGE_SIZE);
#ifdef CONFIG_SMP
reserve_bootmem(PAGE_SIZE, PAGE_SIZE);
#endif
#ifdef CONFIG_ACPI_SLEEP
/*
* Reserve low memory region for sleep support.
*/
acpi_reserve_bootmem();
#endif
1018 init_bootmem()(See Section E.1.1)
the
config_page_data
initialises the
bootmem_data
struct for
node. It sets where physical memory begins and ends
for the node, allocates a bitmap representing the pages and sets all pages as
reserved initially
B.1 Initialising Zones (setup_memory())
204
1020 register_bootmem_low_pages() reads the e820 map and calls free_bootmem()
(See Section E.3.1) for all usable pages in the running system. This is what
marks the pages marked as reserved during initialisation as free
1028-1029 Reserve the pages that are being used to store the bitmap representing
the pages
1035 Reserve page 0 as it is often a special page used by the bios
1043 Reserve an extra page which is required by the trampoline code.
The tram-
poline code deals with how userspace enters kernel space
1045-1050 If sleep support is added, reserve memory required for it.
This is only
of interest to laptops interested in suspending and beyond the scope of this
book
1051 #ifdef CONFIG_X86_LOCAL_APIC
1052
/*
1053
* Find and reserve possible boot-time SMP configuration:
1054
*/
1055
find_smp_config();
1056 #endif
1057 #ifdef CONFIG_BLK_DEV_INITRD
1058
if (LOADER_TYPE && INITRD_START) {
1059
if (INITRD_START + INITRD_SIZE <=
(max_low_pfn << PAGE_SHIFT)) {
1060
reserve_bootmem(INITRD_START, INITRD_SIZE);
1061
initrd_start =
1062
INITRD_START ? INITRD_START + PAGE_OFFSET : 0;
1063
initrd_end = initrd_start+INITRD_SIZE;
1064
}
1065
else {
1066
printk(KERN_ERR
"initrd extends beyond end of memory "
1067
"(0x%08lx > 0x%08lx)\ndisabling initrd\n",
1068
INITRD_START + INITRD_SIZE,
1069
max_low_pfn << PAGE_SHIFT);
1070
initrd_start = 0;
1071
}
1072
}
1073 #endif
1074
1075
return max_low_pfn;
1076 }
1055 This function reserves memory that stores cong information about the SMP
setup
B.1 Initialising Zones (setup_memory())
205
1057-1073 If initrd is enabled, the memory containing its image will be reserved.
initrd provides a tiny lesystem image which is used to boot the system
1075 Return the upper limit of addressable memory in ZONE_NORMAL
B.1.2
Function: zone_sizes_init()
(arch/i386/mm/init.c)
This is the top-level function which is used to initialise each of the zones. The size
of the zones in PFNs was discovered during
setup_memory() (See Section B.1.1).
free_area_init().
This function populates an array of zone sizes for passing to
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
static void __init zone_sizes_init(void)
{
unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
unsigned int max_dma, high, low;
max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
low = max_low_pfn;
high = highend_pfn;
if (low < max_dma)
zones_size[ZONE_DMA] = low;
else {
zones_size[ZONE_DMA] = max_dma;
zones_size[ZONE_NORMAL] = low - max_dma;
#ifdef CONFIG_HIGHMEM
zones_size[ZONE_HIGHMEM] = high - low;
#endif
}
free_area_init(zones_size);
}
325 Initialise the sizes to 0
328 Calculate the PFN for the maximum possible DMA address.
as the largest number of pages that may exist in
This doubles up
ZONE_DMA
329 max_low_pfn is the highest PFN available to ZONE_NORMAL
330 highend_pfn is the highest PFN available to ZONE_HIGHMEM
332-333 If the highest PFN in ZONE_NORMAL is below MAX_DMA_ADDRESS, then just
set the size of
ZONE_DMA
to it. The other zones remain at 0
335 Set the number of pages in ZONE_DMA
336
The size of
ZONE_DMA
ZONE_NORMAL
is
max_low_pfn
minus the number of pages in
B.1 Initialising Zones (zone_sizes_init())
338
The size of
206
ZONE_HIGHMEM is the highest possible
ZONE_NORMAL (max_low_pfn)
PFN minus the highest
possible PFN in
B.1.3
Function: free_area_init()
(mm/page_alloc.c)
This is the architecture independant function for setting up a UMA architecture.
contig_page_data
free_area_init_node() instead.
It simply calls the core function passing the static
NUMA architectures will use
as the node.
838 void __init free_area_init(unsigned long *zones_size)
839 {
840
free_area_init_core(0, &contig_page_data, &mem_map, zones_size,
0, 0, 0);
841 }
838 The parameters passed to free_area_init_core() are
0 is the Node Identier for the node, which is 0
contig_page_data is the static global pg_data_t
mem_map is the global mem_map used for tracking struct pages.
function
free_area_init_core()
The
will allocate memory for this array
zones_sizes is the array of zone sizes lled by zone_sizes_init()
0 This zero is the starting physical address
0 The second zero is an array of memory hole sizes which doesn't apply
to
UMA architectures
0
The last 0 is a pointer to a local
mem_map
for this node which is used by
NUMA architectures
B.1.4
Function: free_area_init_node()
There are two versions of this function.
free_area_init()
(mm/numa.c)
The rst is almost identical to
except it uses a dierent starting physical address. There is for
architectures that have only one node (so they use
contig_page_data)
but whose
physical address is not at 0.
This version of the function, called after the pagetable initialisation, if for initialisation each
pgdat
in the system. The caller has the option of allocating their
own local portion of the
mem_map
and passing it in as a parameter if they want to
optimise it's location for the architecture. If they choose not to, it will be allocated
later by
free_area_init_core().
61 void __init free_area_init_node(int nid,
pg_data_t *pgdat, struct page *pmap,
62
unsigned long *zones_size, unsigned long zone_start_paddr,
63
unsigned long *zholes_size)
64 {
B.1 Initialising Zones (free_area_init_node())
65
66
67
68
69
70
71
207
int i, size = 0;
struct page *discard;
if (mem_map == (mem_map_t *)NULL)
mem_map = (mem_map_t *)PAGE_OFFSET;
free_area_init_core(nid, pgdat, &discard, zones_size,
zone_start_paddr,
zholes_size, pmap);
pgdat->node_id = nid;
72
73
74
75
76
77
78
79
80
81
/*
* Get space for the valid bitmap.
*/
for (i = 0; i < MAX_NR_ZONES; i++)
size += zones_size[i];
size = LONG_ALIGN((size + 7) >> 3);
pgdat->valid_addr_bitmap =
(unsigned long *)alloc_bootmem_node(pgdat, size);
memset(pgdat->valid_addr_bitmap, 0, size);
82
83 }
61 The parameters to the function are:
nid is the Node Identier (NID) of the pgdat passed in
pgdat is the node to be initialised
pmap is a pointer to the portion of the mem_map for
this node to use,
frequently passed as NULL and allocated later
zones_size is an array of zone sizes in this node
zone_start_paddr is the starting physical addres for the node
zholes_size is an array of hole sizes in each zone
68-69 If the global mem_map has not been set, set it to the beginning of the kernel
portion of the linear address space. Remeber that with NUMA,
mem_map
is a
virtual array with portions lled in by local maps used by each node
71
free_area_init_core(). Note that discard
parameter as no global mem_map needs to be set for
Call
is passed in as the third
NUMA
73 Record the pgdats NID
78-79 Calculate the total size of the nide
80 Recalculate size as the number of bits requires to have one bit for every byte of
the size
B.1 Initialising Zones (free_area_init_node())
81
208
Allocate a bitmap to represent where valid areas exist in the node. In reality,
this is only used by the sparc architecture so it is unfortunate to waste the
memory every other architecture
82 Initially, all areas are invalid.
Valid regions are marked later in the
mem_init()
functions for the sparc. Other architectures just ignore the bitmap
B.1.5
Function: free_area_init_core()
(mm/page_alloc.c)
This function is responsible for initialising all zones and allocating their local
lmem_map within a node. In UMA architectures, this function is called in a way that
will initialise the global mem_map array. In NUMA architectures, the array is treated
as a virtual array that is sparsely populated.
684 void __init free_area_init_core(int nid,
pg_data_t *pgdat, struct page **gmap,
685
unsigned long *zones_size, unsigned long zone_start_paddr,
686
unsigned long *zholes_size, struct page *lmem_map)
687 {
688
unsigned long i, j;
689
unsigned long map_size;
690
unsigned long totalpages, offset, realtotalpages;
691
const unsigned long zone_required_alignment =
1UL << (MAX_ORDER-1);
692
693
if (zone_start_paddr & ~PAGE_MASK)
694
BUG();
695
696
totalpages = 0;
697
for (i = 0; i < MAX_NR_ZONES; i++) {
698
unsigned long size = zones_size[i];
699
totalpages += size;
700
}
701
realtotalpages = totalpages;
702
if (zholes_size)
703
for (i = 0; i < MAX_NR_ZONES; i++)
704
realtotalpages -= zholes_size[i];
705
706
printk("On node %d totalpages: %lu\n", nid, realtotalpages);
This block is mainly responsible for calculating the size of each zone.
691 The zone must be aligned against the maximum sized block that can be allocated by the buddy allocator for bitwise operations to work
693-694 It is a bug if the physical address is not page aligned
B.1 Initialising Zones (free_area_init_core())
209
696 Initialise the totalpages count for this node to 0
697-700 Calculate the total size of the node by iterating through zone_sizes
701-704 Calculate the real amount of memory by substracting the size of the holes
in
zholes_size
706 Print an informational message for the user on how much memory is available
in this node
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
/*
* Some architectures (with lots of mem and discontinous memory
* maps) have to search for a good mem_map area:
* For discontigmem, the conceptual mem map array starts from
* PAGE_OFFSET, we need to align the actual array onto a mem map
* boundary, so that MAP_NR works.
*/
map_size = (totalpages + 1)*sizeof(struct page);
if (lmem_map == (struct page *)0) {
lmem_map = (struct page *) alloc_bootmem_node(pgdat, map_size);
lmem_map = (struct page *)(PAGE_OFFSET +
MAP_ALIGN((unsigned long)lmem_map - PAGE_OFFSET));
}
*gmap = pgdat->node_mem_map = lmem_map;
pgdat->node_size = totalpages;
pgdat->node_start_paddr = zone_start_paddr;
pgdat->node_start_mapnr = (lmem_map - mem_map);
pgdat->nr_zones = 0;
offset = lmem_map - mem_map;
lmem_map if necessary and sets the gmap. In UMA
mem_map and so this is where the memory for it is
This block allocates the local
architectures,
gmap
is actually
allocated
715 Calculate the amount of memory required for the array.
of pages multipled by the size of a
It is the total number
struct page
716 If the map has not already been allocated, allocate it
717 Allocate the memory from the boot memory allocator
718 MAP_ALIGN()
with the
struct page sized boundary for calmem_map based on the physical address
will align the array on a
culations that locate osets within the
MAP_NR()
macro
721 Set the gmap and pgdat→node_mem_map variables to the allocated lmem_map.
In UMA architectures, this just set
mem_map
B.1 Initialising Zones (free_area_init_core())
210
722 Record the size of the node
723 Record the starting physical address
724 Record what the oset within mem_map this node occupies
725 Initialise the zone count to 0.
727 offset
This will be set later in the function
is now the oset within
mem_map
that the local portion
lmem_map
begins at
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
for (j = 0; j < MAX_NR_ZONES; j++) {
zone_t *zone = pgdat->node_zones + j;
unsigned long mask;
unsigned long size, realsize;
zone_table[nid * MAX_NR_ZONES + j] = zone;
realsize = size = zones_size[j];
if (zholes_size)
realsize -= zholes_size[j];
printk("zone(%lu): %lu pages.\n", j, size);
zone->size = size;
zone->name = zone_names[j];
zone->lock = SPIN_LOCK_UNLOCKED;
zone->zone_pgdat = pgdat;
zone->free_pages = 0;
zone->need_balance = 0;
if (!size)
continue;
This block starts a loop which initialises every
zone_t
within the node.
The
initialisation starts with the setting of the simplier elds that values already exist
for.
728 Loop through all zones in the node
733 Record a pointer to this zone in the zone_table.
734-736
See Section 2.4.1
Calculate the real size of the zone based on the full size in
minus the size of the holes in
zholes_size
zones_size
738 Print an informational message saying how many pages are in this zone
739 Record the size of the zone
740 zone_names is the string name of the zone for printing purposes
741-744 Initialise some other elds for the zone such as it's parent pgdat
B.1 Initialising Zones (free_area_init_core())
211
745-746 If the zone has no memory, continue to the next zone as nothing further
is required
752
753
754
755
756
757
758
759
760
zone->wait_table_size = wait_table_size(size);
zone->wait_table_shift =
BITS_PER_LONG - wait_table_bits(zone->wait_table_size);
zone->wait_table = (wait_queue_head_t *)
alloc_bootmem_node(pgdat, zone->wait_table_size
* sizeof(wait_queue_head_t));
for(i = 0; i < zone->wait_table_size; ++i)
init_waitqueue_head(zone->wait_table + i);
Initialise the waitqueue for this zone. Processes waiting on pages in the zone use
this hashed table to select a queue to wait on. This means that all processes waiting
in a zone will not have to be woken when a page is unlocked, just a smaller subset.
752 wait_table_size() calculates the size of the table to use based on the number
of pages in the zone and the desired ratio between the number of queues and
the number of pages. The table will never be larger than 4KiB
753-754 Calculate the shift for the hashing algorithm
755 Allocate a table of wait_queue_head_t that can hold zone→wait_table_size
entries
759-760 Initialise all of the wait queues
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
pgdat->nr_zones = j+1;
mask = (realsize / zone_balance_ratio[j]);
if (mask < zone_balance_min[j])
mask = zone_balance_min[j];
else if (mask > zone_balance_max[j])
mask = zone_balance_max[j];
zone->pages_min = mask;
zone->pages_low = mask*2;
zone->pages_high = mask*3;
zone->zone_mem_map = mem_map + offset;
zone->zone_start_mapnr = offset;
zone->zone_start_paddr = zone_start_paddr;
if ((zone_start_paddr >> PAGE_SHIFT) &
(zone_required_alignment-1))
printk("BUG: wrong zone alignment, it will crash\n");
B.1 Initialising Zones (free_area_init_core())
212
Calculate the watermarks for the zone and record the location of the zone. The
watermarks are calculated as ratios of the zone size.
762 First, as a new zone is active, update the number of zones in this node
764
Calculate the mask (which will be used as the
pages_min
watermark) as the
size of the zone divided by the balance ratio for this zone. The balance ratio
is 128 for all zones as declared at the top of
765-766
The zone_balance_min ratios
pages_min will never be below 20
767-768
Similarly, the
mm/page_alloc.c
are 20 for all zones so this means that
zone_balance_max
ratios are all 255 so
pages_min
will
never be over 255
769 pages_min is set to mask
770 pages_low is twice the number of pages as pages_min
771 pages_high is three times the number of pages as pages_min
773 Record where the rst struct page for this zone is located within mem_map
774 Record the index within mem_map this zone begins at
775 Record the starting physical address
777-778 Ensure that the zone is correctly aligned for use with the buddy allocator
otherwise the bitwise operations used for the buddy allocator will break
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
/*
* Initially all pages are reserved - free ones are freed
* up by free_all_bootmem() once the early boot process is
* done. Non-atomic initialization, single-pass.
*/
for (i = 0; i < size; i++) {
struct page *page = mem_map + offset + i;
set_page_zone(page, nid * MAX_NR_ZONES + j);
set_page_count(page, 0);
SetPageReserved(page);
INIT_LIST_HEAD(&page->list);
if (j != ZONE_HIGHMEM)
set_page_address(page, __va(zone_start_paddr));
zone_start_paddr += PAGE_SIZE;
}
B.1 Initialising Zones (free_area_init_core())
785-794
213
Initially, all pages in the zone are marked as reserved as there is no way
to know which ones are in use by the boot memory allocator. When the boot
memory allocator is retiring in
have their
PG_reserved
free_all_bootmem(),
the unused pages will
bit cleared
786 Get the page for this oset
787 The zone the page belongs to is encoded with the page ags.
See Section 2.4.1
788 Set the count to 0 as no one is using it
789
Set the reserved ag. Later, the boot memory allocator will clear this bit if
the page is no longer in use
790 Initialise the list head for the page
791-792
Set the
page→virtual
eld if it is available and the page is in low
memory
793
Increment
zone_start_paddr
by a page size as this variable will be used to
record the beginning of the next zone
796
797
798
799
800
801
802
803
804
805
829
830
831
832
833
834
835
836 }
offset += size;
for (i = 0; ; i++) {
unsigned long bitmap_size;
INIT_LIST_HEAD(&zone->free_area[i].free_list);
if (i == MAX_ORDER-1) {
zone->free_area[i].map = NULL;
break;
}
bitmap_size = (size-1) >> (i+4);
bitmap_size = LONG_ALIGN(bitmap_size+1);
zone->free_area[i].map =
(unsigned long *) alloc_bootmem_node(pgdat,
bitmap_size);
}
}
build_zonelists(pgdat);
This block initialises the free lists for the zone and allocates the bitmap used by
the buddy allocator to record the state of page buddies.
797 This will loop from 0 to MAX_ORDER-1
800 Initialise the linked list for the free_list of the current order i
B.1 Initialising Zones (free_area_init_core())
801-804
214
If this is the last order, then set the free area map to NULL as this is
what marks the end of the free lists
829
Calculate the
bitmap_size
to be the number of bytes required to hold a
i
bitmap where each bit represents on pair of buddies that are 2 number of
pages
830
Align the size to a long with
LONG_ALIGN()
as all bitwise operations are on
longs
831-832 Allocate the memory for the map
834 This loops back to move to the next zone
835 Build the zone fallback lists for this node with build_zonelists()
B.1.6
Function: build_zonelists()
(mm/page_alloc.c)
This builds the list of fallback zones for each zone in the requested node. This
is for when an allocation cannot be satisied and another zone is consulted. When
this is nished, allocatioons from
locations from
ZONE_NORMAL
ZONE_HIGHMEM will fallback to ZONE_NORMAL. AlZONE_DMA which in turn has nothing
will fall back to
to fall back on.
589 static inline void build_zonelists(pg_data_t *pgdat)
590 {
591
int i, j, k;
592
593
for (i = 0; i <= GFP_ZONEMASK; i++) {
594
zonelist_t *zonelist;
595
zone_t *zone;
596
597
zonelist = pgdat->node_zonelists + i;
598
memset(zonelist, 0, sizeof(*zonelist));
599
600
j = 0;
601
k = ZONE_NORMAL;
602
if (i & __GFP_HIGHMEM)
603
k = ZONE_HIGHMEM;
604
if (i & __GFP_DMA)
605
k = ZONE_DMA;
606
607
switch (k) {
608
default:
609
BUG();
610
/*
611
* fallthrough:
B.1 Initialising Zones (build_zonelists())
215
612
*/
613
case ZONE_HIGHMEM:
614
zone = pgdat->node_zones
615
if (zone->size) {
616 #ifndef CONFIG_HIGHMEM
617
BUG();
618 #endif
619
zonelist->zones[j++]
620
}
621
case ZONE_NORMAL:
622
zone = pgdat->node_zones
623
if (zone->size)
624
zonelist->zones[j++]
625
case ZONE_DMA:
626
zone = pgdat->node_zones
627
if (zone->size)
628
zonelist->zones[j++]
629
}
630
zonelist->zones[j++] = NULL;
631
}
632 }
+ ZONE_HIGHMEM;
= zone;
+ ZONE_NORMAL;
= zone;
+ ZONE_DMA;
= zone;
593 This looks through the maximum possible number of zones
597 Get the zonelist for this zone and zero it
600 Start j at 0 which corresponds to ZONE_DMA
601-605 Set k to be the type of zone currently being examined
614 Get the ZONE_HIGHMEM
615-620
ZONE_HIGHMEM is the preferred zone to
allocations. If ZONE_HIGHMEM has no memory,
If the zone has memory, then
allocate from for high memory
ZONE_NORMAL will become the preferred zone when the next case is fallen
through to as j is not incremented for an empty zone
then
621-624
Set the next preferred zone to allocate from to be
ZONE_NORMAL.
Again,
do not use it if the zone has no memory
626-628
Set the nal fallback zone to be
ZONE_DMA having
a ZONE_DMA
ZONE_DMA.
The check is still made for
memory as in a NUMA architecture, not all nodes will have
B.2 Page Operations
216
B.2 Page Operations
Contents
B.2 Page Operations
B.2.1 Locking Pages
B.2.1.1 Function: lock_page()
B.2.1.2 Function: __lock_page()
B.2.1.3 Function: sync_page()
B.2.2 Unlocking Pages
B.2.2.1 Function: unlock_page()
B.2.3 Waiting on Pages
B.2.3.1 Function: wait_on_page()
B.2.3.2 Function: ___wait_on_page()
216
216
216
216
217
218
218
219
219
219
B.2.1 Locking Pages
B.2.1.1
Function: lock_page()
(mm/lemap.c)
This function tries to lock a page. If the page cannot be locked, it will cause the
process to sleep until the page is available.
921 void lock_page(struct page *page)
922 {
923
if (TryLockPage(page))
924
__lock_page(page);
925 }
923 TryLockPage()
PG_locked
is just a wrapper around
bit in
page→flags.
test_and_set_bit()
for the
If the bit was previously clear, the function
returns immediately as the page is now locked
924 Otherwise call __lock_page()(See Section B.2.1.2) to put the process to sleep
B.2.1.2
Function: __lock_page()
This is called after a
TryLockPage()
(mm/lemap.c)
failed. It will locate the waitqueue for this
page and sleep on it until the lock can be acquired.
897 static void __lock_page(struct page *page)
898 {
899
wait_queue_head_t *waitqueue = page_waitqueue(page);
900
struct task_struct *tsk = current;
901
DECLARE_WAITQUEUE(wait, tsk);
902
903
add_wait_queue_exclusive(waitqueue, &wait);
904
for (;;) {
905
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
B.2.1 Locking Pages (__lock_page())
906
907
908
909
910
911
912
913
914
915 }
217
if (PageLocked(page)) {
sync_page(page);
schedule();
}
if (!TryLockPage(page))
break;
}
__set_task_state(tsk, TASK_RUNNING);
remove_wait_queue(waitqueue, &wait);
899 page_waitqueue() is the implementation of the hash algorithm which determines which wait queue this page belongs to in the table
zone→wait_table
900-901 Initialise the waitqueue for this task
903 Add this process to the waitqueue returned by page_waitqueue()
904-912 Loop here until the lock is acquired
905 Set the process states as being in uninterruptible sleep.
When
schedule()
is
called, the process will be put to sleep and will not wake again until the queue
is explicitly woken up
906 If the page is still locked then call sync_page() function to schedule the page
to be synchronised with it's backing storage. Call
schedule()
to sleep until
the queue is woken up such as when IO on the page completes
910-1001
Try and lock the page again.
If we succeed, exit the loop, otherwise
sleep on the queue again
913-914
The lock is now acquired so set the process state to
TASK_RUNNING
and
remove it from the wait queue. The function now returns with the lock acquired
B.2.1.3
Function: sync_page()
This calls the lesystem-specic
(mm/lemap.c)
sync_page() to synchronsise the page with it's
backing storage.
140 static inline int sync_page(struct page *page)
141 {
142
struct address_space *mapping = page->mapping;
143
144
if (mapping && mapping->a_ops && mapping->a_ops->sync_page)
145
return mapping->a_ops->sync_page(page);
146
return 0;
147 }
B.2.2 Unlocking Pages
218
142 Get the address_space for the page if it exists
144-145 If a backing exists, and it has an associated address_space_operations
which provides a
sync_page()
function, then call it
B.2.2 Unlocking Pages
B.2.2.1
Function: unlock_page()
(mm/lemap.c)
This function unlocks a page and wakes up any processes that may be waiting
on it.
874 void unlock_page(struct page *page)
875 {
876
wait_queue_head_t *waitqueue = page_waitqueue(page);
877
ClearPageLaunder(page);
878
smp_mb__before_clear_bit();
879
if (!test_and_clear_bit(PG_locked, &(page)->flags))
880
BUG();
881
smp_mb__after_clear_bit();
882
883
/*
884
* Although the default semantics of wake_up() are
885
* to wake all, here the specific function is used
886
* to make it even more explicit that a number of
887
* pages are being waited on here.
888
*/
889
if (waitqueue_active(waitqueue))
890
wake_up_all(waitqueue);
891 }
876 page_waitqueue() is the implementation of the hash algorithm which determines which wait queue this page belongs to in the table
zone→wait_table
877 Clear the launder bit as IO has now completed on the page
878 This is a memory block operations which must be called before performing bit
operations that may be seen by multiple processors
879-880 Clear the PG_locked bit.
It is a
BUG()
if the bit was already cleared
881 Complete the SMP memory block operation
889-890 If there are processes waiting on the page queue for this page, wake them
B.2.3 Waiting on Pages
219
B.2.3 Waiting on Pages
B.2.3.1
Function: wait_on_page()
(include/linux/pagemap.h)
94 static inline void wait_on_page(struct page * page)
95 {
96
if (PageLocked(page))
97
___wait_on_page(page);
98 }
96-97 If the page is currently locked, then call ___wait_on_page() to sleep until
it is unlocked
B.2.3.2
is
Function: ___wait_on_page()
(mm/lemap.c)
This function is called after PageLocked() has been used to determine the
locked. The calling process will probably sleep until the page is unlocked.
page
849 void ___wait_on_page(struct page *page)
850 {
851
wait_queue_head_t *waitqueue = page_waitqueue(page);
852
struct task_struct *tsk = current;
853
DECLARE_WAITQUEUE(wait, tsk);
854
855
add_wait_queue(waitqueue, &wait);
856
do {
857
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
858
if (!PageLocked(page))
859
break;
860
sync_page(page);
861
schedule();
862
} while (PageLocked(page));
863
__set_task_state(tsk, TASK_RUNNING);
864
remove_wait_queue(waitqueue, &wait);
865 }
851 page_waitqueue() is the implementation of the hash algorithm which determines which wait queue this page belongs to in the table
zone→wait_table
852-853 Initialise the waitqueue for the current task
855 Add this task to the waitqueue returned by page_waitqueue()
857
Set the process state to be in uninterruptible sleep.
When
schedule()
called, the process will sleep
858-859 Check to make sure the page was not unlocked since we last checked
is
B.2.3 Waiting on Pages (___wait_on_page())
220
860 Call sync_page()(See Section B.2.1.3) to call the lesystem-specic function
to synchronise the page with it's backing storage
861
Call
schedule()
to go to sleep. The process will be woken when the page is
unlocked
862 Check if the page is still locked.
Remember that multiple pages could be using
this wait queue and there could be processes sleeping that wish to lock this
page
863-864 The page has been unlocked.
Set the process to be in the
state and remove the process from the waitqueue
TASK_RUNNING
Appendix C
Page Table Management
Contents
C.1 Page Table Initialisation . . . . . . . . . . . . . . . . . . . . . . . 222
C.1.1
C.1.2
C.1.3
C.1.4
Function:
Function:
Function:
Function:
paging_init() . . . . . . . . . . . . . . . . . . . . . . 222
pagetable_init() . . . . . . . . . . . . . . . . . . . . 223
fixrange_init() . . . . . . . . . . . . . . . . . . . . . 227
kmap_init() . . . . . . . . . . . . . . . . . . . . . . . 228
C.2 Page Table Walking . . . . . . . . . . . . . . . . . . . . . . . . . . 230
C.2.1 Function: follow_page() . . . . . . . . . . . . . . . . . . . . . . 230
221
C.1 Page Table Initialisation
222
C.1 Page Table Initialisation
Contents
C.1 Page Table Initialisation
C.1.1 Function: paging_init()
C.1.2 Function: pagetable_init()
C.1.3 Function: fixrange_init()
C.1.4 Function: kmap_init()
C.1.1
Function: paging_init()
222
222
223
227
228
(arch/i386/mm/init.c)
setup_arch().
This is the top-level function called from
When this function
returns, the page tables have been fully setup. Be aware that this is all x86 specic.
351
352
353
354
355
356
357
362
363
364
365
366
367
368
369
370
371
372
void __init paging_init(void)
{
pagetable_init();
load_cr3(swapper_pg_dir);
#if CONFIG_X86_PAE
if (cpu_has_pae)
set_in_cr4(X86_CR4_PAE);
#endif
__flush_tlb_all();
#ifdef CONFIG_HIGHMEM
kmap_init();
#endif
zone_sizes_init();
}
353 pagetable_init()
swapper_pg_dir
355
is responsible for setting up a static page table using
as the PGD
Load the initialised
swapper_pg_dir
into the CR3 register so that the CPU
will be able to use it
362-363 If PAE is enabled, set the appropriate bit in the CR4 register
366 Flush all TLBs, including the global kernel ones
369 kmap_init() initialises the region of pagetables reserved for use with kmap()
371 zone_sizes_init()
before calling
(See Section B.1.2) records the size of each of the zones
free_area_init()
(See Section B.1.3) to initialise each zone
C.1.2 Function: pagetable_init()
C.1.2
223
Function: pagetable_init()
(arch/i386/mm/init.c)
This function is responsible for statically inialising a pagetable starting with a
statically dened PGD called
available that points to every
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
swapper_pg_dir. At the very
page frame in ZONE_NORMAL.
least, a PTE will be
static void __init pagetable_init (void)
{
unsigned long vaddr, end;
pgd_t *pgd, *pgd_base;
int i, j, k;
pmd_t *pmd;
pte_t *pte, *pte_base;
/*
* This can be zero as well - no problem, in that case we exit
* the loops anyway due to the PTRS_PER_* conditions.
*/
end = (unsigned long)__va(max_low_pfn*PAGE_SIZE);
pgd_base = swapper_pg_dir;
#if CONFIG_X86_PAE
for (i = 0; i < PTRS_PER_PGD; i++)
set_pgd(pgd_base + i, __pgd(1 + __pa(empty_zero_page)));
#endif
i = __pgd_offset(PAGE_OFFSET);
pgd = pgd_base + i;
This rst block initialises the PGD. It does this by pointing each entry to the
global zero page. Entries needed to reference available memory in
ZONE_NORMAL will
be allocated later.
217 The variable end marks the end of physical memory in ZONE_NORMAL
219 pgd_base is set to the beginning of the statically declared PGD
220-223 If PAE is enabled, it is insucent to leave each entry as simply 0 (which
in eect points each entry to the global zero page) as each
Instead,
pgd_t
is a struct.
set_pgd must be called for each pgd_t to point the entyr to the global
zero page
224 i is initialised as the oset within the PGD that corresponds to PAGE_OFFSET.
In other words, this function will only be initialising the kernel portion of the
linear address space, the userspace portion is left alone
225 pgd
is initialised to the
pgd_t
corresponding to the beginning of the kernel
portion of the linear address space
C.1 Page Table Initialisation (pagetable_init())
224
227
for (; i < PTRS_PER_PGD; pgd++, i++) {
228
vaddr = i*PGDIR_SIZE;
229
if (end && (vaddr >= end))
230
break;
231 #if CONFIG_X86_PAE
232
pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
233
set_pgd(pgd, __pgd(__pa(pmd) + 0x1));
234 #else
235
pmd = (pmd_t *)pgd;
236 #endif
237
if (pmd != pmd_offset(pgd, 0))
238
BUG();
This loop begins setting up valid PMD entries to point to. In the PAE case, pages
are allocated with
alloc_bootmem_low_pages() and the PGD is set appropriately.
Without PAE, there is no middle directory, so it is just folded back onto the PGD
to preserve the illustion of a 3-level pagetable.
227 i
is already initialised to the beginning of the kernel portion of the linear
address space so keep looping until the last
pgd_t at PTRS_PER_PGD is reached
228 Calculate the virtual address for this PGD
229-230 If the end of ZONE_NORMAL is reached, exit the loop as further page table
entries are not needed
231-234 If PAE is enabled, allocate a page for the PMD and it with set_pgd()
235 If PAE is not available, just set pmd to the current pgd_t.
This is the folding
back trick for emulating 3-level pagetables
237-238 Sanity check to make sure the PMD is valid
239
240
241
242
243
244
245
246
247
248
249
250
251
252
for (j = 0; j < PTRS_PER_PMD; pmd++, j++) {
vaddr = i*PGDIR_SIZE + j*PMD_SIZE;
if (end && (vaddr >= end))
break;
if (cpu_has_pse) {
unsigned long __pe;
set_in_cr4(X86_CR4_PSE);
boot_cpu_data.wp_works_ok = 1;
__pe = _KERNPG_TABLE + _PAGE_PSE + __pa(vaddr);
/* Make it "global" too if supported */
if (cpu_has_pge) {
set_in_cr4(X86_CR4_PGE);
__pe += _PAGE_GLOBAL;
C.1 Page Table Initialisation (pagetable_init())
253
254
255
256
257
258
}
225
}
set_pmd(pmd, __pmd(__pe));
continue;
pte_base = pte =
(pte_t *) alloc_bootmem_low_pages(PAGE_SIZE);
259
Initialise each entry in the PMD. This loop will only execute unless PAE is
enabled. Remember that without PAE,
PTRS_PER_PMD
is 1.
240 Calculate the virtual address for this PMD
241-242 If the end of ZONE_NORMAL is reached, nish
243-248
If the CPU support PSE, then use large TLB entries. This means that
for kernel pages, a TLB entry will map 4MiB instead of the normal 4KiB and
the third level of PTEs is unnecessary
258 __pe
is set as the ags for a kernel pagetable (_KERNPG_TABLE), the ag to
indicate that this is an entry mapping 4MiB (_PAGE_PSE) and then set to the
physical address for this virtual address with
__pa().
Note that this means
that 4MiB of physical memory is not being mapped by the pagetables
250-253
If the CPU supports PGE, then set it for this page table entry.
This
marks the entry as being global and visible to all processes
254-255 As the third level is not required because of PSE, set the PMD now with
set_pmd()
and continue to the next PMD
258 Else, PSE is not support and PTEs are required so allocate a page for them
260
261
262
263
264
265
266
267
268
269
270
271
for (k = 0; k < PTRS_PER_PTE; pte++, k++) {
vaddr = i*PGDIR_SIZE + j*PMD_SIZE + k*PAGE_SIZE;
if (end && (vaddr >= end))
break;
*pte = mk_pte_phys(__pa(vaddr), PAGE_KERNEL);
}
set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte_base)));
if (pte_base != pte_offset(pmd, 0))
BUG();
}
}
Initialise the PTEs.
C.1 Page Table Initialisation (pagetable_init())
260-265
For each
pte_t,
226
calculate the virtual address currently being examined
and create a PTE that points to the appropriate physical page frame
266 The PTEs have been initialised so set the PMD to point to it
267-268 Make sure that the entry was established correctly
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
/*
* Fixed mappings, only the page table structure has to be
* created - mappings will be set by set_fixmap():
*/
vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK;
fixrange_init(vaddr, 0, pgd_base);
#if CONFIG_HIGHMEM
/*
* Permanent kmaps:
*/
vaddr = PKMAP_BASE;
fixrange_init(vaddr, vaddr + PAGE_SIZE*LAST_PKMAP, pgd_base);
pgd = swapper_pg_dir + __pgd_offset(vaddr);
pmd = pmd_offset(pgd, vaddr);
pte = pte_offset(pmd, vaddr);
pkmap_page_table = pte;
#endif
#if CONFIG_X86_PAE
/*
* Add low memory identity-mappings - SMP needs it when
* starting up on an AP from real-mode. In the non-PAE
* case we already have these mappings through head.S.
* All user-space mappings are explicitly cleared after
* SMP startup.
*/
pgd_base[0] = pgd_base[USER_PTRS_PER_PGD];
#endif
}
At this point, page table entries have been setup which reference all parts of
ZONE_NORMAL. The remaining regions needed are those for xed mappings and those
needed for mapping high memory pages with kmap().
277
The xed address space is considered to start at
earlier in the address space.
__fix_to_virt()
FIXADDR_TOP
and nish
takes an index as a parameter
and returns the index'th page frame backwards (starting from
FIXADDR_TOP)
C.1 Page Table Initialisation (pagetable_init())
within the the xed virtual address space.
227
__end_of_fixed_addresses is the
last index used by the xed virtual address space. In other words, this line
returns the virtual address of the PMD that corresponds to the beginning of
the xed virtual address space
278
By passing 0 as the end to
vaddr
fixrange_init(),
the function will start at
and build valid PGDs and PMDs until the end of the virtual address
space. PTEs are not needed for these addresses
280-291 Set up page tables for use with kmap()
287-290
Get the PTE corresponding to the beginning of the region for use with
kmap()
301 This sets up a temporary identity mapping between the virtual address 0 and
the physical address 0
C.1.3
Function: fixrange_init()
(arch/i386/mm/init.c)
This function creates valid PGDs and PMDs for xed virtual address mappings.
167 static void __init fixrange_init (unsigned long start,
unsigned long end,
pgd_t *pgd_base)
168 {
169
pgd_t *pgd;
170
pmd_t *pmd;
171
pte_t *pte;
172
int i, j;
173
unsigned long vaddr;
174
175
vaddr = start;
176
i = __pgd_offset(vaddr);
177
j = __pmd_offset(vaddr);
178
pgd = pgd_base + i;
179
180
for ( ; (i < PTRS_PER_PGD) && (vaddr != end); pgd++, i++) {
181 #if CONFIG_X86_PAE
182
if (pgd_none(*pgd)) {
183
pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
184
set_pgd(pgd, __pgd(__pa(pmd) + 0x1));
185
if (pmd != pmd_offset(pgd, 0))
186
printk("PAE BUG #02!\n");
187
}
188
pmd = pmd_offset(pgd, vaddr);
189 #else
190
pmd = (pmd_t *)pgd;
C.1 Page Table Initialisation (fixrange_init())
228
191 #endif
192
for (; (j < PTRS_PER_PMD) && (vaddr != end); pmd++, j++) {
193
if (pmd_none(*pmd)) {
194
pte = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE);
195
set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte)));
196
if (pte != pte_offset(pmd, 0))
197
BUG();
198
}
199
vaddr += PMD_SIZE;
200
}
201
j = 0;
202
}
203 }
175 Set the starting virtual address (vadd) to the requested starting address provided as the parameter
176 Get the index within the PGD corresponding to vaddr
177 Get the index within the PMD corresponding to vaddr
178 Get the starting pgd_t
180 Keep cycling until end is reached.
When
pagetable_init()
passes in 0, this
loop will continue until the end of the PGD
182-187
In the case of PAE, allocate a page for the PMD if one has not already
been allocated
190 Without PAE, there is no PMD so treat the pgd_t as the pmd_t
192-200 For each entry in the PMD, allocate a page for the pte_t entries and set
it within the pagetables. Note that
C.1.4
vaddr is incremented in PMD-sized strides
Function: kmap_init()
This function only exists if
(arch/i386/mm/init.c)
CONFIG_HIGHMEM is set during
compile time.
It is
responsible for caching where the beginning of the kmap region is, the PTE referencing it and the protection for the page tables. This means the PGD will not have
to be checked every time
74
75
76
77
78
79
80
kmap()
is used.
#if CONFIG_HIGHMEM
pte_t *kmap_pte;
pgprot_t kmap_prot;
#define kmap_get_fixmap_pte(vaddr)
\
pte_offset(pmd_offset(pgd_offset_k(vaddr), (vaddr)), (vaddr))
C.1 Page Table Initialisation (kmap_init())
81
82
83
84
85
86
87
e8
89
90
91
229
void __init kmap_init(void)
{
unsigned long kmap_vstart;
/* cache the first kmap pte */
kmap_vstart = __fix_to_virt(FIX_KMAP_BEGIN);
kmap_pte = kmap_get_fixmap_pte(kmap_vstart);
kmap_prot = PAGE_KERNEL;
}
#endif /* CONFIG_HIGHMEM */
78-79
As
fixrange_init()
has already set up valid PGDs and PMDs, there is
no need to double check them so
kmap_get_fixmap_pte()
is responsible for
quickly traversing the page table
86 Cache the virtual address for the kmap region in kmap_vstart
87 Cache the PTE for the start of the kmap region in kmap_pte
89 Cache the protection for the page table entries with kmap_prot
C.2 Page Table Walking
230
C.2 Page Table Walking
Contents
C.2 Page Table Walking
C.2.1 Function: follow_page()
C.2.1
230
230
Function: follow_page()
(mm/memory.c)
This function returns the struct page used by the PTE at address in mm's page
tables.
405 static struct page * follow_page(struct mm_struct *mm,
unsigned long address,
int write)
406 {
407
pgd_t *pgd;
408
pmd_t *pmd;
409
pte_t *ptep, pte;
410
411
pgd = pgd_offset(mm, address);
412
if (pgd_none(*pgd) || pgd_bad(*pgd))
413
goto out;
414
415
pmd = pmd_offset(pgd, address);
416
if (pmd_none(*pmd) || pmd_bad(*pmd))
417
goto out;
418
419
ptep = pte_offset(pmd, address);
420
if (!ptep)
421
goto out;
422
423
pte = *ptep;
424
if (pte_present(pte)) {
425
if (!write ||
426
(pte_write(pte) && pte_dirty(pte)))
427
return pte_page(pte);
428
}
429
430 out:
431
return 0;
432 }
405 The parameters are the mm whose page tables that is about to be walked, the
address
whose
struct page
is of interest and
write
which indicates if the
page is about to be written to
411 Get the PGD for the address and make sure it is present and valid
C.2 Page Table Walking (follow_page())
231
415-417 Get the PMD for the address and make sure it is present and valid
419 Get the PTE for the address and make sure it exists
424 If the PTE is currently present, then we have something to return
425-426
If the caller has indicated a write is about to take place, check to make
sure that the PTE has write permissions set and if so, make the PTE dirty
427
If the PTE is present and the permissions are ne, return the
struct page
mapped by the PTE
431 Return 0 indicating that the address has no associated struct page
Appendix D
Process Address Space
Contents
D.1 Process Memory Descriptors . . . . . . . . . . . . . . . . . . . . . 236
D.1.1 Initalising a Descriptor . . . . . . .
D.1.2 Copying a Descriptor . . . . . . . .
D.1.2.1 Function: copy_mm() . .
D.1.2.2 Function: mm_init() . .
D.1.3 Allocating a Descriptor . . . . . .
D.1.3.1 Function: allocate_mm()
D.1.3.2 Function: mm_alloc() . .
D.1.4 Destroying a Descriptor . . . . . .
D.1.4.1 Function: mmput() . . . .
D.1.4.2 Function: mmdrop() . . .
D.1.4.3 Function: __mmdrop() . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
236
236
236
239
239
239
240
240
240
241
242
D.2.1 Creating A Memory Region . . . . . . . . .
D.2.1.1 Function: do_mmap() . . . . . . .
D.2.1.2 Function: do_mmap_pgoff() . . .
D.2.2 Inserting a Memory Region . . . . . . . . .
D.2.2.1 Function: __insert_vm_struct()
D.2.2.2 Function: find_vma_prepare() .
D.2.2.3 Function: vma_link() . . . . . . .
D.2.2.4 Function: __vma_link() . . . . .
D.2.2.5 Function: __vma_link_list() . .
D.2.2.6 Function: __vma_link_rb() . . .
D.2.2.7 Function: __vma_link_file() . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
243
243
244
252
252
253
255
256
256
257
257
D.2 Creating Memory Regions . . . . . . . . . . . . . . . . . . . . . . 243
232
APPENDIX D. PROCESS ADDRESS SPACE
233
D.2.3 Merging Contiguous Regions . . . . . . . .
D.2.3.1 Function: vma_merge() . . . . . .
D.2.3.2 Function: can_vma_merge() . . .
D.2.4 Remapping and Moving a Memory Region .
D.2.4.1 Function: sys_mremap() . . . . .
D.2.4.2 Function: do_mremap() . . . . . .
D.2.4.3 Function: move_vma() . . . . . . .
D.2.4.4 Function: make_pages_present()
D.2.4.5 Function: get_user_pages() . . .
D.2.4.6 Function: move_page_tables() .
D.2.4.7 Function: move_one_page() . . .
D.2.4.8 Function: get_one_pte() . . . . .
D.2.4.9 Function: alloc_one_pte() . . .
D.2.4.10 Function: copy_one_pte() . . . .
D.2.5 Deleting a memory region . . . . . . . . . .
D.2.5.1 Function: do_munmap() . . . . . .
D.2.5.2 Function: unmap_fixup() . . . . .
D.2.6 Deleting all memory regions . . . . . . . . .
D.2.6.1 Function: exit_mmap() . . . . . .
D.2.6.2 Function: clear_page_tables() .
D.2.6.3 Function: free_one_pgd() . . . .
D.2.6.4 Function: free_one_pmd() . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
258
258
260
261
261
261
267
271
272
276
277
277
278
279
280
280
284
287
287
290
290
291
D.3.1 Finding a Mapped Memory Region . . . . . . . .
D.3.1.1 Function: find_vma() . . . . . . . . . .
D.3.1.2 Function: find_vma_prev() . . . . . .
D.3.1.3 Function: find_vma_intersection() .
D.3.2 Finding a Free Memory Region . . . . . . . . . .
D.3.2.1 Function: get_unmapped_area() . . . .
D.3.2.2 Function: arch_get_unmapped_area()
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
293
293
294
296
296
296
297
D.4.1 Locking a Memory Region . . . . . .
D.4.1.1 Function: sys_mlock() . .
D.4.1.2 Function: sys_mlockall()
D.4.1.3 Function: do_mlockall() .
D.4.1.4 Function: do_mlock() . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
299
299
300
302
303
D.3 Searching Memory Regions . . . . . . . . . . . . . . . . . . . . . 293
D.4 Locking and Unlocking Memory Regions . . . . . . . . . . . . . 299
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
APPENDIX D. PROCESS ADDRESS SPACE
D.4.2 Unlocking the region . . . . . . . . . . . . .
D.4.2.1 Function: sys_munlock() . . . . .
D.4.2.2 Function: sys_munlockall() . . .
D.4.3 Fixing up regions after locking/unlocking .
D.4.3.1 Function: mlock_fixup() . . . . .
D.4.3.2 Function: mlock_fixup_all() . .
D.4.3.3 Function: mlock_fixup_start() .
D.4.3.4 Function: mlock_fixup_end() . .
D.4.3.5 Function: mlock_fixup_middle()
234
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
305
305
306
306
306
308
308
309
310
D.5.1 x86 Page Fault Handler . . . . . . . . . . . .
D.5.1.1 Function: do_page_fault() . . . .
D.5.2 Expanding the Stack . . . . . . . . . . . . . .
D.5.2.1 Function: expand_stack() . . . . .
D.5.3 Architecture Independent Page Fault Handler
D.5.3.1 Function: handle_mm_fault() . . .
D.5.3.2 Function: handle_pte_fault() . .
D.5.4 Demand Allocation . . . . . . . . . . . . . . .
D.5.4.1 Function: do_no_page() . . . . . .
D.5.4.2 Function: do_anonymous_page() . .
D.5.5 Demand Paging . . . . . . . . . . . . . . . . .
D.5.5.1 Function: do_swap_page() . . . . .
D.5.5.2 Function: can_share_swap_page()
D.5.5.3 Function: exclusive_swap_page()
D.5.6 Copy On Write (COW) Pages . . . . . . . . .
D.5.6.1 Function: do_wp_page() . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
313
313
323
323
324
324
326
327
327
330
332
332
336
337
338
338
D.6.1 Generic File Reading . . . . . . . . . . . . . . . .
D.6.1.1 Function: generic_file_read() . . . .
D.6.1.2 Function: do_generic_file_read() . .
D.6.1.3 Function: generic_file_readahead()
D.6.2 Generic File mmap() . . . . . . . . . . . . . . . .
D.6.2.1 Function: generic_file_mmap() . . . .
D.6.3 Generic File Truncation . . . . . . . . . . . . . .
D.6.3.1 Function: vmtruncate() . . . . . . . .
D.6.3.2 Function: vmtruncate_list() . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
341
341
344
351
355
355
356
356
358
D.5 Page Faulting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
D.6 Page-Related Disk IO . . . . . . . . . . . . . . . . . . . . . . . . . 341
APPENDIX D. PROCESS ADDRESS SPACE
D.6.3.3 Function: zap_page_range() . . . . . . . . .
D.6.3.4 Function: zap_pmd_range() . . . . . . . . .
D.6.3.5 Function: zap_pte_range() . . . . . . . . .
D.6.3.6 Function: truncate_inode_pages() . . . . .
D.6.3.7 Function: truncate_list_pages() . . . . .
D.6.3.8 Function: truncate_complete_page() . . .
D.6.3.9 Function: do_flushpage() . . . . . . . . . .
D.6.3.10 Function: truncate_partial_page() . . . .
D.6.4 Reading Pages for the Page Cache . . . . . . . . . . .
D.6.4.1 Function: filemap_nopage() . . . . . . . . .
D.6.4.2 Function: page_cache_read() . . . . . . . .
D.6.5 File Readahead for nopage() . . . . . . . . . . . . . .
D.6.5.1 Function: nopage_sequential_readahead()
D.6.5.2 Function: read_cluster_nonblocking() . .
D.6.6 Swap Related Read-Ahead . . . . . . . . . . . . . . . .
D.6.6.1 Function: swapin_readahead() . . . . . . .
D.6.6.2 Function: valid_swaphandles() . . . . . . .
235
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
359
361
362
364
365
367
368
368
369
369
374
375
375
377
378
378
379
D.1 Process Memory Descriptors
236
D.1 Process Memory Descriptors
Contents
D.1 Process Memory Descriptors
D.1.1 Initalising a Descriptor
D.1.2 Copying a Descriptor
D.1.2.1 Function: copy_mm()
D.1.2.2 Function: mm_init()
D.1.3 Allocating a Descriptor
D.1.3.1 Function: allocate_mm()
D.1.3.2 Function: mm_alloc()
D.1.4 Destroying a Descriptor
D.1.4.1 Function: mmput()
D.1.4.2 Function: mmdrop()
D.1.4.3 Function: __mmdrop()
236
236
236
236
239
239
239
240
240
240
241
242
This section covers the functions used to allocate, initialise, copy and destroy
memory descriptors.
D.1.1 Initalising a Descriptor
The initial
mm_struct in the system is called init_mm and is statically initialised at
INIT_MM().
compile time using the macro
238 #define INIT_MM(name) \
239 {
\
240
mm_rb:
RB_ROOT,
\
241
pgd:
swapper_pg_dir,
\
242
mm_users:
ATOMIC_INIT(2),
\
243
mm_count:
ATOMIC_INIT(1),
\
244
mmap_sem:
__RWSEM_INITIALIZER(name.mmap_sem),\
245
page_table_lock: SPIN_LOCK_UNLOCKED,
\
246
mmlist:
LIST_HEAD_INIT(name.mmlist),
\
247 }
mm_structs
copy_mm() with the
Once it is established, new
and are copied using
init_mm().
are copies of their parent
mm_struct
process specic elds initialised with
D.1.2 Copying a Descriptor
D.1.2.1
Function: copy_mm()
(kernel/fork.c)
mm_struct for
This function makes a copy of the
called from
mm_struct.
do_fork()
the given task. This is only
after a new process has been created and needs its own
D.1.2 Copying a Descriptor (copy_mm())
237
315 static int copy_mm(unsigned long clone_flags,
struct task_struct * tsk)
316 {
317
struct mm_struct * mm, *oldmm;
318
int retval;
319
320
tsk->min_flt = tsk->maj_flt = 0;
321
tsk->cmin_flt = tsk->cmaj_flt = 0;
322
tsk->nswap = tsk->cnswap = 0;
323
324
tsk->mm = NULL;
325
tsk->active_mm = NULL;
326
327
/*
328
* Are we cloning a kernel thread?
330
* We need to steal a active VM for that..
331
*/
332
oldmm = current->mm;
333
if (!oldmm)
334
return 0;
335
336
if (clone_flags & CLONE_VM) {
337
atomic_inc(&oldmm->mm_users);
338
mm = oldmm;
339
goto good_mm;
340
}
Reset elds that are not inherited by a child
mm_struct
and nd a mm to copy
from.
315
The parameters are the ags passed for clone and the task that is creating a
copy of the
mm_struct
320-325 Initialise the task_struct elds related to memory management
332 Borrow the mm of the current running process to copy from
333 A kernel thread has no mm so it can return immediately
336-341 If the CLONE_VM ag is set, the child process is to share the mm with the
parent process. This is required by users like pthreads. The
mm_users eld is
good_mm label
incremented so the mm is not destroyed prematurely later. The
sets
342
343
344
tsk→mm
and
tsk→active_mm
retval = -ENOMEM;
mm = allocate_mm();
if (!mm)
and returns success
D.1.2 Copying a Descriptor (copy_mm())
345
346
347
348
349
350
351
352
353
354
355
356
357
358
238
goto fail_nomem;
/* Copy the current MM stuff.. */
memcpy(mm, oldmm, sizeof(*mm));
if (!mm_init(mm))
goto fail_nomem;
if (init_new_context(tsk,mm))
goto free_pt;
down_write(&oldmm->mmap_sem);
retval = dup_mmap(mm);
up_write(&oldmm->mmap_sem);
343 Allocate a new mm
348-350
Copy the parent mm and initialise the process specic mm elds with
352-353
Initialise the MMU context for architectures that do not automatically
init_mm()
manage their MMU
355-357
Call
dup_mmap()
which is responsible for copying all the VMAs regions
in use by the parent process
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
if (retval)
goto free_pt;
/*
* child gets a private LDT (if there was an LDT in the parent)
*/
copy_segments(tsk, mm);
good_mm:
tsk->mm = mm;
tsk->active_mm = mm;
return 0;
free_pt:
mmput(mm);
fail_nomem:
return retval;
}
359 dup_mmap()
mmput()
returns 0 on success.
If it failed, the label
which decrements the use count of the mm
free_pt
will call
D.1.2 Copying a Descriptor (copy_mm())
239
365 This copies the LDT for the new process based on the parent process
368-370 Set the new mm, active_mm and return success
D.1.2.2
Function: mm_init()
(kernel/fork.c)
This function initialises process specic mm elds.
230 static struct mm_struct * mm_init(struct mm_struct * mm)
231 {
232
atomic_set(&mm->mm_users, 1);
233
atomic_set(&mm->mm_count, 1);
234
init_rwsem(&mm->mmap_sem);
235
mm->page_table_lock = SPIN_LOCK_UNLOCKED;
236
mm->pgd = pgd_alloc(mm);
237
mm->def_flags = 0;
238
if (mm->pgd)
239
return mm;
240
free_mm(mm);
241
return NULL;
242 }
232 Set the number of users to 1
233 Set the reference count of the mm to 1
234 Initialise the semaphore protecting the VMA list
235 Initialise the spinlock protecting write access to it
236 Allocate a new PGD for the struct
237 By default, pages used by the process are not locked in memory
238 If a PGD exists, return the initialised struct
240 Initialisation failed, delete the mm_struct and return
D.1.3 Allocating a Descriptor
mm_struct. To be slightly confusing, they
allocate_mm() will allocate a mm_struct from the slab
mm_alloc() will allocate the struct and then call the function mm_init()
Two functions are provided allocating a
are essentially the name.
allocator.
to initialise it.
D.1.3.1
Function: allocate_mm()
227 #define allocate_mm()
(kernel/fork.c)
(kmem_cache_alloc(mm_cachep, SLAB_KERNEL))
226 Allocate a mm_struct from the slab allocator
D.1.3.2 Function: mm_alloc()
D.1.3.2
240
Function: mm_alloc()
(kernel/fork.c)
248 struct mm_struct * mm_alloc(void)
249 {
250
struct mm_struct * mm;
251
252
mm = allocate_mm();
253
if (mm) {
254
memset(mm, 0, sizeof(*mm));
255
return mm_init(mm);
256
}
257
return NULL;
258 }
252 Allocate a mm_struct from the slab allocator
254 Zero out all contents of the struct
255 Perform basic initialisation
D.1.4 Destroying a Descriptor
A new user to an mm increments the usage count with a simple call,
atomic_inc(&mm->mm_users};
mmput(). If the mm_users count reaches zero,
all the mapped regions are deleted with exit_mmap() and the page tables destroyed
as there is no longer any users of the userspace portions. The mm_count count
is decremented with mmdrop() as all the users of the page tables and VMAs are
counted as one mm_struct user. When mm_count reaches zero, the mm_struct will
It is decremented with a call to
be destroyed.
D.1.4.1
Function: mmput()
(kernel/fork.c)
276 void mmput(struct mm_struct *mm)
277 {
278
if (atomic_dec_and_lock(&mm->mm_users, &mmlist_lock)) {
279
extern struct mm_struct *swap_mm;
280
if (swap_mm == mm)
281
swap_mm = list_entry(mm->mmlist.next,
struct mm_struct, mmlist);
282
list_del(&mm->mmlist);
283
mmlist_nr--;
284
spin_unlock(&mmlist_lock);
285
exit_mmap(mm);
286
mmdrop(mm);
287
}
288 }
D.1.4 Destroying a Descriptor (mmput())
241
Figure D.1: Call Graph:
mmput()
278 Atomically decrement the mm_users eld while holding the mmlist_lock lock.
Return with the lock held if the count reaches zero
279-286
If the usage count reaches zero, the mm and associated structures need
to be removed
279-281 The swap_mm is the last mm that was swapped out by the vmscan code.
If the current process was the last mm swapped, move to the next entry in the
list
282 Remove this mm from the list
283-284 Reduce the count of mms in the list and release the mmlist lock
285 Remove all associated mappings
286 Delete the mm
D.1.4.2
Function: mmdrop()
(include/linux/sched.h)
765 static inline void mmdrop(struct mm_struct * mm)
766 {
767
if (atomic_dec_and_test(&mm->mm_count))
768
__mmdrop(mm);
769 }
767 Atomically decrement the reference count.
The reference count could be higher
if the mm was been used by lazy tlb switching tasks
768 If the reference count reaches zero, call __mmdrop()
D.1.4.3 Function: __mmdrop()
D.1.4.3
Function: __mmdrop()
242
(kernel/fork.c)
265 inline void __mmdrop(struct mm_struct *mm)
266 {
267
BUG_ON(mm == &init_mm);
268
pgd_free(mm->pgd);
269
destroy_context(mm);
270
free_mm(mm);
271 }
267 Make sure the init_mm is not destroyed
268 Delete the PGD entry
269 Delete the LDT
270 Call kmem_cache_free() for the mm freeing it with the slab allocator
D.2 Creating Memory Regions
243
D.2 Creating Memory Regions
Contents
D.2 Creating Memory Regions
D.2.1 Creating A Memory Region
D.2.1.1 Function: do_mmap()
D.2.1.2 Function: do_mmap_pgoff()
D.2.2 Inserting a Memory Region
D.2.2.1 Function: __insert_vm_struct()
D.2.2.2 Function: find_vma_prepare()
D.2.2.3 Function: vma_link()
D.2.2.4 Function: __vma_link()
D.2.2.5 Function: __vma_link_list()
D.2.2.6 Function: __vma_link_rb()
D.2.2.7 Function: __vma_link_file()
D.2.3 Merging Contiguous Regions
D.2.3.1 Function: vma_merge()
D.2.3.2 Function: can_vma_merge()
D.2.4 Remapping and Moving a Memory Region
D.2.4.1 Function: sys_mremap()
D.2.4.2 Function: do_mremap()
D.2.4.3 Function: move_vma()
D.2.4.4 Function: make_pages_present()
D.2.4.5 Function: get_user_pages()
D.2.4.6 Function: move_page_tables()
D.2.4.7 Function: move_one_page()
D.2.4.8 Function: get_one_pte()
D.2.4.9 Function: alloc_one_pte()
D.2.4.10 Function: copy_one_pte()
D.2.5 Deleting a memory region
D.2.5.1 Function: do_munmap()
D.2.5.2 Function: unmap_fixup()
D.2.6 Deleting all memory regions
D.2.6.1 Function: exit_mmap()
D.2.6.2 Function: clear_page_tables()
D.2.6.3 Function: free_one_pgd()
D.2.6.4 Function: free_one_pmd()
243
243
243
244
252
252
253
255
256
256
257
257
258
258
260
261
261
261
267
271
272
276
277
277
278
279
280
280
284
287
287
290
290
291
This large section deals with the creation, deletion and manipulation of memory
regions.
D.2.1 Creating A Memory Region
The main call graph for creating a memory region is shown in Figure 4.4.
D.2.1.1
Function: do_mmap()
(include/linux/mm.h)
This is a very simply wrapper function around do_mmap_pgoff() which performs
most of the work.
D.2.1 Creating A Memory Region (do_mmap())
244
557 static inline unsigned long do_mmap(struct file *file,
unsigned long addr,
558
unsigned long len, unsigned long prot,
559
unsigned long flag, unsigned long offset)
560 {
561
unsigned long ret = -EINVAL;
562
if ((offset + PAGE_ALIGN(len)) < offset)
563
goto out;
564
if (!(offset & ~PAGE_MASK))
565
ret = do_mmap_pgoff(file, addr, len, prot, flag,
offset >> PAGE_SHIFT);
566 out:
567
return ret;
568 }
561 By default, return -EINVAL
562-563
Make sure that the size of the region will not overow the total size of
the address space
564-565 Page align the offset and call do_mmap_pgoff() to map the region
D.2.1.2
Function: do_mmap_pgoff()
(mm/mmap.c)
This function is very large and so is broken up into a number of sections. Broadly
speaking the sections are
•
Sanity check the parameters
•
Find a free linear address space large enough for the memory mapping.
get_unmapped_area() function
arch_get_unmapped_area() is called
a lesystem or device specic
will be used otherwise
If
is provided, it
•
Calculate the VM ags and check them against the le access permissions
•
If an old area exists where the mapping is to take place, x it up so it is
suitable for the new mapping
vm_area_struct
•
Allocate a
•
Link in the new VMA
•
Call the lesystem or device specic
•
Update statistics and exit
from the slab allocator and ll in its entries
mmap()
function
D.2.1 Creating A Memory Region (do_mmap_pgoff())
245
393 unsigned long do_mmap_pgoff(struct file * file,
unsigned long addr,
unsigned long len, unsigned long prot,
394
unsigned long flags, unsigned long pgoff)
395 {
396
struct mm_struct * mm = current->mm;
397
struct vm_area_struct * vma, * prev;
398
unsigned int vm_flags;
399
int correct_wcount = 0;
400
int error;
401
rb_node_t ** rb_link, * rb_parent;
402
403
if (file && (!file->f_op || !file->f_op->mmap))
404
return -ENODEV;
405
406
if (!len)
407
return addr;
408
409
len = PAGE_ALIGN(len);
410
if (len > TASK_SIZE || len == 0)
return -EINVAL;
413
414
/* offset overflow? */
415
if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
416
return -EINVAL;
417
418
/* Too many mappings? */
419
if (mm->map_count > max_map_count)
420
return -ENOMEM;
421
393
The parameters which correspond directly to the parameters to the mmap
system call are
le the struct le to mmap if this is a le backed mapping
addr the requested address to map
len the length in bytes to mmap
prot is the permissions on the area
ags are the ags for the mapping
pgo is the oset within the le to begin the mmap at
D.2.1 Creating A Memory Region (do_mmap_pgoff())
403-404
246
If a le or device is been mapped, make sure a lesystem or device
specic mmap function is provided.
generic_file_mmap()(See
For most lesystems, this will call
Section D.6.2.1)
406-407 Make sure a zero length mmap() is not requested
409
Ensure that the mapping is conned to the userspace portion of hte address
space. On the x86, kernel space begins at
415-416
PAGE_OFFSET(3GiB)
Ensure the mapping will not overow the end of the largest possible le
size
419-490
422
423
424
425
426
427
428
425
max_map_count number of mappings are allowed.
DEFAULT_MAX_MAP_COUNT or 65536 mappings
Only
value is
By default this
/* Obtain the address to map to. we verify (or select) it and
* ensure that it represents a valid section of the address space.
*/
addr = get_unmapped_area(file, addr, len, pgoff, flags);
if (addr & ~PAGE_MASK)
return addr;
After basic sanity checks, this function will call the device or le spe-
get_unmapped_area() function. If a device specic one is unavailable,
arch_get_unmapped_area() is called. This function is discussed in Section
cic
D.3.2.2
429
430
431
432
433
434
435
436
437
438
439
440
441
442
/* Do simple checking here so the lower-level routines won't have
* to. we assume access permissions have been handled by the open
* of the memory object, so we don't do any here.
*/
vm_flags = calc_vm_flags(prot,flags) | mm->def_flags
| VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
/* mlock MCL_FUTURE? */
if (vm_flags & VM_LOCKED) {
unsigned long locked = mm->locked_vm << PAGE_SHIFT;
locked += len;
if (locked > current->rlim[RLIMIT_MEMLOCK].rlim_cur)
return -EAGAIN;
}
D.2.1 Creating A Memory Region (do_mmap_pgoff())
247
433 calc_vm_flags() translates the prot and flags from userspace and translates
VM_
them to their
436-440
equivalents
Check if it has been requested that all future mappings be locked in
memory. If yes, make sure the process isn't locking more memory than it is
allowed to. If it is, return
443
444
445
446
447
448
449
-EAGAIN
if (file) {
switch (flags & MAP_TYPE) {
case MAP_SHARED:
if ((prot & PROT_WRITE) &&
!(file->f_mode & FMODE_WRITE))
return -EACCES;
/* Make sure we don't allow writing to
an append-only file.. */
if (IS_APPEND(file->f_dentry->d_inode) &&
(file->f_mode & FMODE_WRITE))
return -EACCES;
450
451
452
453
/* make sure there are no mandatory
locks on the file. */
if (locks_verify_locked(file->f_dentry->d_inode))
return -EAGAIN;
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
vm_flags |= VM_SHARED | VM_MAYSHARE;
if (!(file->f_mode & FMODE_WRITE))
vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
/* fall through */
case MAP_PRIVATE:
if (!(file->f_mode & FMODE_READ))
return -EACCES;
break;
default:
return -EINVAL;
}
443-470 If a le is been memory mapped, check the les access permissions
446-447 If write access is requested, make sure the le is opened for write
450-451 Similarly, if the le is opened for append, make sure it cannot be written
to. The
prot
eld is not checked because the
prot
mapping where as we need to check the opened le
eld applies only to the
D.2.1 Creating A Memory Region (do_mmap_pgoff())
248
453 If the le is mandatory locked, return -EAGAIN so the caller will try a second
type
457-459 Fix up the ags to be consistent with the le ags
463-464 Make sure the le can be read before mmapping it
470
471
472
473
474
475
476
477
478
479
480
481
} else {
vm_flags |= VM_SHARED | VM_MAYSHARE;
switch (flags & MAP_TYPE) {
default:
return -EINVAL;
case MAP_PRIVATE:
vm_flags &= ~(VM_SHARED | VM_MAYSHARE);
/* fall through */
case MAP_SHARED:
break;
}
}
471-481
If the le is been mapped for anonymous use, x up the ags if the
requested mapping is
MAP_PRIVATE
to make sure the ags are consistent
483
/* Clear old maps */
484 munmap_back:
485
vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
486
if (vma && vma->vm_start < addr + len) {
487
if (do_munmap(mm, addr, len))
488
return -ENOMEM;
489
goto munmap_back;
490
}
485 find_vma_prepare()(See Section D.2.2.2)
steps through the RB tree for the
VMA corresponding to a given address
486-488
If a VMA was found and it is part of the new mmaping, remove the old
mapping as the new one will cover both
D.2.1 Creating A Memory Region (do_mmap_pgoff())
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
249
/* Check against address space limit. */
if ((mm->total_vm << PAGE_SHIFT) + len
> current->rlim[RLIMIT_AS].rlim_cur)
return -ENOMEM;
/* Private writable mapping? Check memory availability.. */
if ((vm_flags & (VM_SHARED | VM_WRITE)) == VM_WRITE &&
!(flags & MAP_NORESERVE)
&&
!vm_enough_memory(len >> PAGE_SHIFT))
return -ENOMEM;
/* Can we just expand an old anonymous mapping? */
if (!file && !(vm_flags & VM_SHARED) && rb_parent)
if (vma_merge(mm, prev, rb_parent,
addr, addr + len, vm_flags))
goto out;
493-495
Make sure the new mapping will not will not exceed the total VM a
process is allowed to have. It is unclear why this check is not made earlier
498-501
If the caller does not specically request that free space is not checked
with
and it is a private mapping, make sure enough memory
MAP_NORESERVE
is available to satisfy the mapping under current conditions
504-506 If two adjacent memory mappings are anonymous and can be treated as
one, expand the old mapping rather than creating a new one
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
/* Determine the object being mapped and call the appropriate
* specific mapper. the address has already been validated, but
* not unmapped, but the maps are removed from the list.
*/
vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
if (!vma)
return -ENOMEM;
vma->vm_mm = mm;
vma->vm_start = addr;
vma->vm_end = addr + len;
vma->vm_flags = vm_flags;
vma->vm_page_prot = protection_map[vm_flags & 0x0f];
vma->vm_ops = NULL;
vma->vm_pgoff = pgoff;
D.2.1 Creating A Memory Region (do_mmap_pgoff())
523
524
525
250
vma->vm_file = NULL;
vma->vm_private_data = NULL;
vma->vm_raend = 0;
512 Allocate a vm_area_struct from the slab allocator
516-525 Fill in the basic vm_area_struct elds
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
if (file) {
error = -EINVAL;
if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
goto free_vma;
if (vm_flags & VM_DENYWRITE) {
error = deny_write_access(file);
if (error)
goto free_vma;
correct_wcount = 1;
}
vma->vm_file = file;
get_file(file);
error = file->f_op->mmap(file, vma);
if (error)
goto unmap_and_free_vma;
527-542 Fill in the le related elds if this is a le been mapped
529-530 These are both invalid ags for a le mapping so free the vm_area_struct
and return
531-536 This ag is cleared by the system call mmap() but is still cleared for kernel
modules that call this function directly. Historically,
-ETXTBUSY
was returned
to the calling process if the underlying le was been written to
537 Fill in the vm_file eld
538 This increments the le usage count
539
mmap()
generic_file_mmap()(See
Call the lesystem or device specic
function.
cases, this will call
Section D.6.2.1)
In many lesystem
540-541 If an error called, goto unmap_and_free_vma to clean up and return the
error
542
543
544
545
546
547
} else if (flags & MAP_SHARED) {
error = shmem_zero_setup(vma);
if (error)
goto free_vma;
}
D.2.1 Creating A Memory Region (do_mmap_pgoff())
543
251
If this is an anonymous shared mapping, the region is created and setup
by
shmem_zero_setup()(See
Section L.7.1).
Anonymous shared pages are
backed by a virtual tmpfs lesystem so that they can be synchronised properly
with swap. The writeback function is
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
shmem_writepage()(See
Section L.6.1)
/* Can addr have changed??
*
* Answer: Yes, several device drivers can do it in their
*
f_op->mmap method. -DaveM
*/
if (addr != vma->vm_start) {
/*
* It is a bit too late to pretend changing the virtual
* area of the mapping, we just corrupted userspace
* in the do_munmap, so FIXME (not in 2.4 to avoid
* breaking the driver API).
*/
struct vm_area_struct * stale_vma;
/* Since addr changed, we rely on the mmap op to prevent
* collisions with existing vmas and just use
* find_vma_prepare to update the tree pointers.
*/
addr = vma->vm_start;
stale_vma = find_vma_prepare(mm, addr, &prev,
&rb_link, &rb_parent);
/*
* Make sure the lowlevel driver did its job right.
*/
if (unlikely(stale_vma && stale_vma->vm_start <
vma->vm_end)) {
printk(KERN_ERR "buggy mmap operation: [<%p>]\n",
file ? file->f_op->mmap : NULL);
BUG();
}
}
vma_link(mm, vma, prev, rb_link, rb_parent);
if (correct_wcount)
atomic_inc(&file->f_dentry->d_inode->i_writecount);
553-576 If the address has changed, it means the device specic mmap operation
moved the VMA address to somewhere else. The function
find_vma_prepare()
(See Section D.2.2.2) is used to nd where the VMA was moved to
D.2.2 Inserting a Memory Region
252
578 Link in the new vm_area_struct
579-580 Update the le write count
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
out:
mm->total_vm += len >> PAGE_SHIFT;
if (vm_flags & VM_LOCKED) {
mm->locked_vm += len >> PAGE_SHIFT;
make_pages_present(addr, addr + len);
}
return addr;
unmap_and_free_vma:
if (correct_wcount)
atomic_inc(&file->f_dentry->d_inode->i_writecount);
vma->vm_file = NULL;
fput(file);
/* Undo any partial mapping done by a device driver. */
zap_page_range(mm, vma->vm_start, vma->vm_end - vma->vm_start);
free_vma:
kmem_cache_free(vm_area_cachep, vma);
return error;
}
583-588 Update statistics for the process mm_struct and return the new address
590-597
This is reached if the le has been partially mapped before failing.
The write statistics are updated and then all user pages are removed with
zap_page_range()
598-600 This goto is used if the mapping failed immediately after the vm_area_struct
is created. It is freed back to the slab allocator before the error is returned
D.2.2 Inserting a Memory Region
The call graph for
D.2.2.1
insert_vm_struct()
is shown in Figure 4.6.
Function: __insert_vm_struct()
(mm/mmap.c)
This is the top level function for inserting a new vma into an address space.
There is a second function like it called simply
insert_vm_struct()
that is not
described in detail here as the only dierence is the one line of code increasing the
map_count.
D.2.2 Inserting a Memory Region (__insert_vm_struct())
253
1174 void __insert_vm_struct(struct mm_struct * mm,
struct vm_area_struct * vma)
1175 {
1176
struct vm_area_struct * __vma, * prev;
1177
rb_node_t ** rb_link, * rb_parent;
1178
1179
__vma = find_vma_prepare(mm, vma->vm_start, &prev,
&rb_link, &rb_parent);
1180
if (__vma && __vma->vm_start < vma->vm_end)
1181
BUG();
1182
__vma_link(mm, vma, prev, rb_link, rb_parent);
1183
mm->map_count++;
1184
validate_mm(mm);
1185 }
1174
The arguments are the
and the
vm_area_struct
mm_struct
that represents the linear address space
that is to be inserted
1179 find_vma_prepare()(See Section D.2.2.2) locates where the new VMA can
be inserted.
It will be inserted between
prev
and
__vma
and the required
nodes for the red-black tree are also returned
1180-1181 This is a check to make sure the returned VMA is invalid.
It is virtually
impossible for this condition to occur without manually inserting bogus VMAs
into the address space
1182 This function does the actual work of linking the vma struct into the linear
linked list and the red-black tree
1183
Increase the
is not present
map_count to show a new
in insert_vm_struct()
mapping has been added. This line
1184 validate_mm() is a debugging macro for red-black trees.
If
DEBUG_MM_RB
is
set, the linear list of VMAs and the tree will be traversed to make sure it is
valid. The tree traversal is a recursive function so it is very important that
that it is used only if really necessary as a large number of mappings could
cause a stack overow. If it is not set,
D.2.2.2
Function: find_vma_prepare()
validate_mm()
does nothing at all
(mm/mmap.c)
This is responsible for nding the correct places to insert a VMA at the supplied
address. It returns a number of pieces of information via the actual return and the
function arguments. The forward VMA to link to is returned with return. pprev is
the previous node which is required because the list is a singly linked list.
and
rb_link
rb_parent are the parent and leaf node the new VMA will be inserted between.
D.2.2 Inserting a Memory Region (find_vma_prepare())
254
246 static struct vm_area_struct * find_vma_prepare(
struct mm_struct * mm,
unsigned long addr,
247
struct vm_area_struct ** pprev,
248
rb_node_t *** rb_link,
rb_node_t ** rb_parent)
249 {
250
struct vm_area_struct * vma;
251
rb_node_t ** __rb_link, * __rb_parent, * rb_prev;
252
253
__rb_link = &mm->mm_rb.rb_node;
254
rb_prev = __rb_parent = NULL;
255
vma = NULL;
256
257
while (*__rb_link) {
258
struct vm_area_struct *vma_tmp;
259
260
__rb_parent = *__rb_link;
261
vma_tmp = rb_entry(__rb_parent,
struct vm_area_struct, vm_rb);
262
263
if (vma_tmp->vm_end > addr) {
264
vma = vma_tmp;
265
if (vma_tmp->vm_start <= addr)
266
return vma;
267
__rb_link = &__rb_parent->rb_left;
268
} else {
269
rb_prev = __rb_parent;
270
__rb_link = &__rb_parent->rb_right;
271
}
272
}
273
274
*pprev = NULL;
275
if (rb_prev)
276
*pprev = rb_entry(rb_prev, struct vm_area_struct, vm_rb);
277
*rb_link = __rb_link;
278
*rb_parent = __rb_parent;
279
return vma;
280 }
246 The function arguments are described above
253-255 Initialise the search
D.2.2 Inserting a Memory Region (find_vma_prepare())
263-272
255
This is a similar tree walk to what was described for
find_vma().
The
only real dierence is the nodes last traversed are remembered with the
__rb_link
and
__rb_parent
variables
275-276 Get the back linking VMA via the red-black tree
279 Return the forward linking VMA
D.2.2.3
Function: vma_link()
(mm/mmap.c)
This is the top-level function for linking a VMA into the proper lists.
It is
responsible for acquiring the necessary locks to make a safe insertion
337 static inline void vma_link(struct mm_struct * mm,
struct vm_area_struct * vma,
struct vm_area_struct * prev,
338
rb_node_t ** rb_link, rb_node_t * rb_parent)
339 {
340
lock_vma_mappings(vma);
341
spin_lock(&mm->page_table_lock);
342
__vma_link(mm, vma, prev, rb_link, rb_parent);
343
spin_unlock(&mm->page_table_lock);
344
unlock_vma_mappings(vma);
345
346
mm->map_count++;
347
validate_mm(mm);
348 }
337 mm is the address space the VMA is to be inserted into. prev is the backwards
linked VMA for the linear linked list of VMAs.
rb_link
and
rb_parent
are
the nodes required to make the rb insertion
340 This function acquires the spinlock protecting the address_space representing
the le that is been memory mapped.
341 Acquire the page table lock which protects the whole mm_struct
342 Insert the VMA
343 Free the lock protecting the mm_struct
345 Unlock the address_space for the le
346 Increase the number of mappings in this mm
347
If
DEBUG_MM_RB
is set, the RB trees and linked lists will be checked to make
sure they are still valid
D.2.2.4 Function: __vma_link()
D.2.2.4
Function: __vma_link()
256
(mm/mmap.c)
This simply calls three helper functions which are responsible for linking the
VMA into the three linked lists that link VMAs together.
329 static void __vma_link(struct mm_struct * mm,
struct vm_area_struct * vma,
struct vm_area_struct * prev,
330
rb_node_t ** rb_link, rb_node_t * rb_parent)
331 {
332
__vma_link_list(mm, vma, prev, rb_parent);
333
__vma_link_rb(mm, vma, rb_link, rb_parent);
334
__vma_link_file(vma);
335 }
332
333
This links the VMA into the linear linked lists of VMAs in this mm via the
vm_next field
This links the VMA into the red-black tree of VMAs in this mm whose root
is stored in the vm_rb eld
334
This links the VMA into the shared mapping VMA links. Memory mapped
les are linked together over potentially many mms by this function via the
vm_next_share
D.2.2.5
and
vm_pprev_share
Function: __vma_link_list()
elds
(mm/mmap.c)
282 static inline void __vma_link_list(struct mm_struct * mm,
struct vm_area_struct * vma,
struct vm_area_struct * prev,
283
rb_node_t * rb_parent)
284 {
285
if (prev) {
286
vma->vm_next = prev->vm_next;
287
prev->vm_next = vma;
288
} else {
289
mm->mmap = vma;
290
if (rb_parent)
291
vma->vm_next = rb_entry(rb_parent,
struct vm_area_struct,
vm_rb);
292
else
293
vma->vm_next = NULL;
294
}
295 }
285 If prev is not null, the vma is simply inserted into the list
D.2.2 Inserting a Memory Region (__vma_link_list())
257
289 Else this is the rst mapping and the rst element of the list has to be stored
in the
mm_struct
290 The VMA is stored as the parent node
D.2.2.6
Function: __vma_link_rb()
(mm/mmap.c)
The principal workings of this function are stored within
<linux/rbtree.h>
and will not be discussed in detail in this book.
297 static inline void __vma_link_rb(struct mm_struct * mm,
struct vm_area_struct * vma,
298
rb_node_t ** rb_link,
rb_node_t * rb_parent)
299 {
300
rb_link_node(&vma->vm_rb, rb_parent, rb_link);
301
rb_insert_color(&vma->vm_rb, &mm->mm_rb);
302 }
D.2.2.7
Function: __vma_link_file()
(mm/mmap.c)
This function links the VMA into a linked list of shared le mappings.
304 static inline void __vma_link_file(struct vm_area_struct * vma)
305 {
306
struct file * file;
307
308
file = vma->vm_file;
309
if (file) {
310
struct inode * inode = file->f_dentry->d_inode;
311
struct address_space *mapping = inode->i_mapping;
312
struct vm_area_struct **head;
313
314
if (vma->vm_flags & VM_DENYWRITE)
315
atomic_dec(&inode->i_writecount);
316
317
head = &mapping->i_mmap;
318
if (vma->vm_flags & VM_SHARED)
319
head = &mapping->i_mmap_shared;
320
321
/* insert vma into inode's share list */
322
if((vma->vm_next_share = *head) != NULL)
323
(*head)->vm_pprev_share = &vma->vm_next_share;
324
*head = vma;
325
vma->vm_pprev_share = head;
326
}
327 }
D.2.3 Merging Contiguous Regions
258
309 Check to see if this VMA has a shared le mapping.
If it does not, this function
has nothing more to do
310-312 Extract the relevant information about the mapping from the VMA
314-315
If this mapping is not allowed to write even if the permissions are ok
i_writecount
for writing, decrement the
eld. A negative value to this eld
indicates that the le is memory mapped and may not be written to. Eorts
to open the le for writing will now fail
317-319 Check to make sure this is a shared mapping
322-325 Insert the VMA into the shared mapping linked list
D.2.3 Merging Contiguous Regions
D.2.3.1
Function: vma_merge()
(mm/mmap.c)
This function checks to see if a region pointed to be
forwards to cover the area from
addr
to
end
prev
may be expanded
instead of allocating a new VMA. If it
cannot, the VMA ahead is checked to see can it be expanded backwards instead.
350 static int vma_merge(struct mm_struct * mm,
struct vm_area_struct * prev,
351
rb_node_t * rb_parent,
unsigned long addr, unsigned long end,
unsigned long vm_flags)
352 {
353
spinlock_t * lock = &mm->page_table_lock;
354
if (!prev) {
355
prev = rb_entry(rb_parent, struct vm_area_struct, vm_rb);
356
goto merge_next;
357
}
350 The parameters are as follows;
mm The mm the VMAs belong to
prev The VMA before the address we are interested in
rb_parent The parent RB node as returned by find_vma_prepare()
addr The starting address of the region to be merged
end The end of the region to be merged
vm_ags The permission ags of the region to be merged
353 This is the lock to the mm
D.2.3 Merging Contiguous Regions (vma_merge())
354-357
If
prev
is not passed it, it is taken to mean that the VMA being tested
for merging is in front of the region from
VMA is extracted from the
358
359
360
361
362
363
364
259
rb_parent
addr
to
end.
The entry for that
if (prev->vm_end == addr && can_vma_merge(prev, vm_flags)) {
struct vm_area_struct * next;
spin_lock(lock);
prev->vm_end = end;
next = prev->vm_next;
if (next && prev->vm_end == next->vm_start &&
can_vma_merge(next, vm_flags)) {
prev->vm_end = next->vm_end;
__vma_unlink(mm, next, prev);
spin_unlock(lock);
365
366
367
368
369
mm->map_count--;
370
kmem_cache_free(vm_area_cachep, next);
371
return 1;
372
}
373
spin_unlock(lock);
374
return 1;
375
}
376
377
prev = prev->vm_next;
378
if (prev) {
379 merge_next:
380
if (!can_vma_merge(prev, vm_flags))
381
return 0;
382
if (end == prev->vm_start) {
383
spin_lock(lock);
384
prev->vm_start = addr;
385
spin_unlock(lock);
386
return 1;
387
}
388
}
389
390
return 0;
391 }
358-375 Check to see can the region pointed to by prev may be expanded to cover
the current region
358 The function can_vma_merge() checks the permissions of prev with those in
vm_flags
and that the VMA has no le mappings (i.e. it is anonymous). If
it is true, the area at
prev
may be expanded
D.2.3 Merging Contiguous Regions (vma_merge())
260
361 Lock the mm
362 Expand the end of the VMA region (vm_end) to the end of the new mapping
(end)
363 next is now the VMA in front of the newly expanded VMA
364 Check if the expanded region can be merged with the VMA in front of it
365 If it can, continue to expand the region to cover the next VMA
366 As a VMA has been merged, one region is now defunct and may be unlinked
367 No further adjustments are made to the mm struct so the lock is released
369 There is one less mapped region to reduce the map_count
370 Delete the struct describing the merged VMA
371 Return success
377
If this line is reached it means the region pointed to by
prev
could not be
expanded forward so a check is made to see if the region ahead can be merged
backwards instead
382-388 Same idea as the above block except instead of adjusted vm_end to cover
end, vm_start
D.2.3.2
is expanded to cover
Function: can_vma_merge()
addr
(include/linux/mm.h)
This trivial function checks to see if the permissions of the supplied VMA match
the permissions in
vm_flags
582 static inline int can_vma_merge(struct vm_area_struct * vma,
unsigned long vm_flags)
583 {
584
if (!vma->vm_file && vma->vm_flags == vm_flags)
585
return 1;
586
else
587
return 0;
588 }
584
Self explanatory.
Return true if there is no le/device mapping (i.e.
anonymous) and the VMA ags for both regions match
it is
D.2.4 Remapping and Moving a Memory Region
261
D.2.4 Remapping and Moving a Memory Region
D.2.4.1
Function: sys_mremap()
(mm/mremap.c)
The call graph for this function is shown in Figure 4.7. This is the system service
call to remap a memory region
347 asmlinkage unsigned long sys_mremap(unsigned long addr,
348
unsigned long old_len, unsigned long new_len,
349
unsigned long flags, unsigned long new_addr)
350 {
351
unsigned long ret;
352
353
down_write(&current->mm->mmap_sem);
354
ret = do_mremap(addr, old_len, new_len, flags, new_addr);
355
up_write(&current->mm->mmap_sem);
356
return ret;
357 }
347-349
The parameters are the same as those described in the
mremap()
man
page
353 Acquire the mm semaphore
354 do_mremap()(See Section D.2.4.2)
is the top level function for remapping a
region
355 Release the mm semaphore
356 Return the status of the remapping
D.2.4.2
Function: do_mremap()
(mm/mremap.c)
This function does most of the actual work required to remap, resize and move
a memory region. It is quite long but can be broken up into distinct parts which
will be dealt with separately here. The tasks are broadly speaking
•
Check usage ags and page align lengths
•
Handle the condition where
MAP_FIXED is set and the region is been moved to
a new location.
•
If a region is shrinking, allow it to happen unconditionally
•
If the region is growing or moving, perform a number of checks in advance to
make sure the move is allowed and safe
•
Handle the case where the region is been expanded and cannot be moved
•
Finally handle the case where the region has to be resized and moved
D.2.4 Remapping and Moving a Memory Region (do_mremap())
262
219 unsigned long do_mremap(unsigned long addr,
220
unsigned long old_len, unsigned long new_len,
221
unsigned long flags, unsigned long new_addr)
222 {
223
struct vm_area_struct *vma;
224
unsigned long ret = -EINVAL;
225
226
if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
227
goto out;
228
229
if (addr & ~PAGE_MASK)
230
goto out;
231
232
old_len = PAGE_ALIGN(old_len);
233
new_len = PAGE_ALIGN(new_len);
234
219 The parameters of the function are
addr is the old starting address
old_len is the old region length
new_len is the new region length
ags is the option ags passed. If MREMAP_MAYMOVE is specied, it means that
the region is allowed to move if there is not enough linear address space
at the current space.
If
MREMAP_FIXED
is specied, it means that the
whole region is to move to the specied new_addr with the new length.
The area from
do_munmap().
new_addr
to
new_addr+new_len
will be unmapped with
new_addr is the address of the new region if it is moved
224 At this point, the default return is -EINVAL for invalid arguments
226-227 Make sure ags other than the two allowed ags are not used
229-230 The address passed in must be page aligned
232-233 Page align the passed region lengths
236
237
238
239
240
241
if (flags & MREMAP_FIXED) {
if (new_addr & ~PAGE_MASK)
goto out;
if (!(flags & MREMAP_MAYMOVE))
goto out;
D.2.4 Remapping and Moving a Memory Region (do_mremap())
242
243
244
245
246
247
248
249
250
251
252
253
254
255
263
if (new_len > TASK_SIZE || new_addr > TASK_SIZE - new_len)
goto out;
/* Check if the location we're moving into overlaps the
* old location at all, and fail if it does.
*/
if ((new_addr <= addr) && (new_addr+new_len) > addr)
goto out;
if ((addr <= new_addr) && (addr+old_len) > new_addr)
goto out;
}
do_munmap(current->mm, new_addr, new_len);
This block handles the condition where the region location is xed and must be
fully moved. It ensures the area been moved to is safe and denitely unmapped.
236 MREMAP_FIXED is the ag which indicates the location is xed
237-238 The specied new_addr must be be page aligned
239-240 If MREMAP_FIXED is specied, then the MAYMOVE ag must be used as well
242-243 Make sure the resized region does not exceed TASK_SIZE
248-249
Just as the comments indicate, the two regions been used for the move
may not overlap
254 Unmap the region that is about to be used.
It is presumed the caller ensures
that the region is not in use for anything important
261
262
263
264
265
266
ret = addr;
if (old_len >= new_len) {
do_munmap(current->mm, addr+new_len, old_len - new_len);
if (!(flags & MREMAP_FIXED) || (new_addr == addr))
goto out;
}
261 At this point, the address of the resized region is the return value
262 If the old length is larger than the new length, then the region is shrinking
263 Unmap the unused region
264-235 If the region is not to be moved, either because MREMAP_FIXED is not used
or the new address matches the old address, goto out which will return the
address
D.2.4 Remapping and Moving a Memory Region (do_mremap())
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
264
ret = -EFAULT;
vma = find_vma(current->mm, addr);
if (!vma || vma->vm_start > addr)
goto out;
/* We can't remap across vm area boundaries */
if (old_len > vma->vm_end - addr)
goto out;
if (vma->vm_flags & VM_DONTEXPAND) {
if (new_len > old_len)
goto out;
}
if (vma->vm_flags & VM_LOCKED) {
unsigned long locked = current->mm->locked_vm << PAGE_SHIFT;
locked += new_len - old_len;
ret = -EAGAIN;
if (locked > current->rlim[RLIMIT_MEMLOCK].rlim_cur)
goto out;
}
ret = -ENOMEM;
if ((current->mm->total_vm << PAGE_SHIFT) + (new_len - old_len)
> current->rlim[RLIMIT_AS].rlim_cur)
goto out;
/* Private writable mapping? Check memory availability.. */
if ((vma->vm_flags & (VM_SHARED | VM_WRITE)) == VM_WRITE &&
!(flags & MAP_NORESERVE) &&
!vm_enough_memory((new_len - old_len) >> PAGE_SHIFT))
goto out;
Do a number of checks to make sure it is safe to grow or move the region
271 At this point, the default action is to return -EFAULT causing a segmentation
fault as the ranges of memory been used are invalid
272 Find the VMA responsible for the requested address
273 If the returned VMA is not responsible for this address, then an invalid address
was used so return a fault
276-277
If the
old_len
passed in exceeds the length of the VMA, it means the
user is trying to remap multiple regions which is not allowed
278-281 If the VMA has been explicitly marked as non-resizable, raise a fault
282-283
If the pages for this VMA must be locked in memory, recalculate the
number of locked pages that will be kept in memory. If the number of pages
exceed the ulimit set for this resource, return EAGAIN indicating to the caller
that the region is locked and cannot be resized
D.2.4 Remapping and Moving a Memory Region (do_mremap())
265
289 The default return at this point is to indicate there is not enough memory
290-292 Ensure that the user will not exist their allowed allocation of memory
294-297
Ensure that there is enough memory to satisfy the request after the
resizing with
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
vm_enough_memory()(See
Section M.1.1)
if (old_len == vma->vm_end - addr &&
!((flags & MREMAP_FIXED) && (addr != new_addr)) &&
(old_len != new_len || !(flags & MREMAP_MAYMOVE))) {
unsigned long max_addr = TASK_SIZE;
if (vma->vm_next)
max_addr = vma->vm_next->vm_start;
/* can we just expand the current mapping? */
if (max_addr - addr >= new_len) {
int pages = (new_len - old_len) >> PAGE_SHIFT;
spin_lock(&vma->vm_mm->page_table_lock);
vma->vm_end = addr + new_len;
spin_unlock(&vma->vm_mm->page_table_lock);
current->mm->total_vm += pages;
if (vma->vm_flags & VM_LOCKED) {
current->mm->locked_vm += pages;
make_pages_present(addr + old_len,
addr + new_len);
}
ret = addr;
goto out;
}
}
Handle the case where the region is been expanded and cannot be moved
302 If it is the full region that is been remapped and ...
303 The region is denitely not been moved and ...
304 The region is been expanded and cannot be moved then ...
305 Set the maximum address that can be used to TASK_SIZE, 3GiB on an x86
306-307 If there is another region, set the max address to be the start of the next
region
309-322 Only allow the expansion if the newly sized region does not overlap with
the next VMA
310 Calculate the number of extra pages that will be required
D.2.4 Remapping and Moving a Memory Region (do_mremap())
266
311 Lock the mm spinlock
312 Expand the VMA
313 Free the mm spinlock
314 Update the statistics for the mm
315-319 If the pages for this region are locked in memory, make them present now
320-321 Return the address of the resized region
329
330
331
332
333
334
335
336
ret = -ENOMEM;
if (flags & MREMAP_MAYMOVE) {
if (!(flags & MREMAP_FIXED)) {
unsigned long map_flags = 0;
if (vma->vm_flags & VM_SHARED)
map_flags |= MAP_SHARED;
new_addr = get_unmapped_area(vma->vm_file, 0,
new_len, vma->vm_pgoff, map_flags);
ret = new_addr;
if (new_addr & ~PAGE_MASK)
goto out;
337
338
339
340
}
341
ret = move_vma(vma, addr, old_len, new_len, new_addr);
342
}
343 out:
344
return ret;
345 }
To expand the region, a new one has to be allocated and the old one moved to it
329 The default action is to return saying no memory is available
330 Check to make sure the region is allowed to move
331 If MREMAP_FIXED is not specied, it means the new location was not supplied
so one must be found
333-334 Preserve the MAP_SHARED option
336 Find an unmapped region of memory large enough for the expansion
337 The return value is the address of the new region
338-339
For the returned address to be not page aligned,
get_unmapped_area()
would need to be broken. This could possibly be the case with a buggy device
driver implementing
get_unmapped_area()
incorrectly
D.2.4 Remapping and Moving a Memory Region (do_mremap())
267
341 Call move_vma to move the region
343-344 Return the address if successful and the error code otherwise
D.2.4.3
Function: move_vma()
(mm/mremap.c)
The call graph for this function is shown in Figure 4.8.
This function is re-
sponsible for moving all the page table entries from one VMA to another region.
If necessary a new VMA will be allocated for the region being moved to. Just like
the function above, it is very long but may be broken up into the following distinct
parts.
•
Function preamble, nd the VMA preceding the area about to be moved to
and the VMA in front of the region to be mapped
•
Handle the case where the new location is between two existing VMAs. See
if the preceding region can be expanded forward or the next region expanded
backwards to cover the new mapped region
•
Handle the case where the new location is going to be the last VMA on the
list. See if the preceding region can be expanded forward
•
If a region could not be expanded, allocate a new VMA from the slab allocator
•
Call
move_page_tables(),
ll in the new VMA details if a new one was allo-
cated and update statistics before returning
125 static inline unsigned long move_vma(struct vm_area_struct * vma,
126
unsigned long addr, unsigned long old_len, unsigned long new_len,
127
unsigned long new_addr)
128 {
129
struct mm_struct * mm = vma->vm_mm;
130
struct vm_area_struct * new_vma, * next, * prev;
131
int allocated_vma;
132
133
new_vma = NULL;
134
next = find_vma_prev(mm, new_addr, &prev);
125-127 The parameters are
vma The VMA that the address been moved belongs to
addr The starting address of the moving region
old_len The old length of the region to move
new_len The new length of the region moved
new_addr The new address to relocate to
D.2.4 Remapping and Moving a Memory Region (move_vma())
134
Find the VMA preceding the address been moved to indicated by
return the region after the new mapping as
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
next
268
prev
and
if (next) {
if (prev && prev->vm_end == new_addr &&
can_vma_merge(prev, vma->vm_flags) &&
!vma->vm_file && !(vma->vm_flags & VM_SHARED)) {
spin_lock(&mm->page_table_lock);
prev->vm_end = new_addr + new_len;
spin_unlock(&mm->page_table_lock);
new_vma = prev;
if (next != prev->vm_next)
BUG();
if (prev->vm_end == next->vm_start &&
can_vma_merge(next, prev->vm_flags)) {
spin_lock(&mm->page_table_lock);
prev->vm_end = next->vm_end;
__vma_unlink(mm, next, prev);
spin_unlock(&mm->page_table_lock);
mm->map_count--;
kmem_cache_free(vm_area_cachep, next);
}
} else if (next->vm_start == new_addr + new_len &&
can_vma_merge(next, vma->vm_flags) &&
!vma->vm_file && !(vma->vm_flags & VM_SHARED)) {
spin_lock(&mm->page_table_lock);
next->vm_start = new_addr;
spin_unlock(&mm->page_table_lock);
new_vma = next;
}
} else {
In this block, the new location is between two existing VMAs. Checks are made
to see can be preceding region be expanded to cover the new mapping and then if
it can be expanded to cover the next VMA as well. If it cannot be expanded, the
next region is checked to see if it can be expanded backwards.
136-137 If the preceding region touches the address to be mapped to and may be
merged then enter this block which will attempt to expand regions
138 Lock the mm
139 Expand the preceding region to cover the new location
140 Unlock the mm
D.2.4 Remapping and Moving a Memory Region (move_vma())
269
141 The new VMA is now the preceding VMA which was just expanded
142-143 Make sure the VMA linked list is intact.
It would require a device driver
with severe brain damage to cause this situation to occur
144 Check if the region can be expanded forward to encompass the next region
145 If it can, then lock the mm
146 Expand the VMA further to cover the next VMA
147 There is now an extra VMA so unlink it
148 Unlock the mm
150 There is one less mapping now so update the map_count
151 Free the memory used by the memory mapping
153
Else the
prev
pointed to be
region could not be expanded forward so check if the region
next
may be expanded backwards to cover the new mapping
instead
155 If it can, lock the mm
156 Expand the mapping backwards
157 Unlock the mm
158 The VMA representing the new mapping is now next
161
162
163
164
165
166
167
168
169
}
prev = find_vma(mm, new_addr-1);
if (prev && prev->vm_end == new_addr &&
can_vma_merge(prev, vma->vm_flags) && !vma->vm_file &&
!(vma->vm_flags & VM_SHARED)) {
spin_lock(&mm->page_table_lock);
prev->vm_end = new_addr + new_len;
spin_unlock(&mm->page_table_lock);
new_vma = prev;
}
This block is for the case where the newly mapped region is the last VMA (next
is NULL) so a check is made to see can the preceding region be expanded.
161 Get the previously mapped region
162-163 Check if the regions may be mapped
164 Lock the mm
D.2.4 Remapping and Moving a Memory Region (move_vma())
270
165 Expand the preceding region to cover the new mapping
166 Lock the mm
167 The VMA representing the new mapping is now prev
170
171
172
173
174
175
176
177
178
allocated_vma = 0;
if (!new_vma) {
new_vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
if (!new_vma)
goto out;
allocated_vma = 1;
}
171 Set a ag indicating if a new VMA was not allocated
172 If a VMA has not been expanded to cover the new mapping then...
173 Allocate a new VMA from the slab allocator
174-175 If it could not be allocated, goto out to return failure
176 Set the ag indicated a new VMA was allocated
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
if (!move_page_tables(current->mm, new_addr, addr, old_len)) {
unsigned long vm_locked = vma->vm_flags & VM_LOCKED;
if (allocated_vma) {
*new_vma = *vma;
new_vma->vm_start = new_addr;
new_vma->vm_end = new_addr+new_len;
new_vma->vm_pgoff +=
(addr-vma->vm_start) >> PAGE_SHIFT;
new_vma->vm_raend = 0;
if (new_vma->vm_file)
get_file(new_vma->vm_file);
if (new_vma->vm_ops && new_vma->vm_ops->open)
new_vma->vm_ops->open(new_vma);
insert_vm_struct(current->mm, new_vma);
}
do_munmap(current->mm, addr, old_len);
197
198
current->mm->total_vm += new_len >> PAGE_SHIFT;
if (new_vma->vm_flags & VM_LOCKED) {
D.2.4 Remapping and Moving a Memory Region (move_vma())
271
199
current->mm->locked_vm += new_len >> PAGE_SHIFT;
200
make_pages_present(new_vma->vm_start,
201
new_vma->vm_end);
202
}
203
return new_addr;
204
}
205
if (allocated_vma)
206
kmem_cache_free(vm_area_cachep, new_vma);
207 out:
208
return -ENOMEM;
209 }
179 move_page_tables()(See Section D.2.4.6)
is responsible for copying all the
page table entries. It returns 0 on success
182-193
If a new VMA was allocated, ll in all the relevant details, including
the le/device entries and insert it into the various VMA linked lists with
insert_vm_struct()(See
Section D.2.2.1)
194 Unmap the old region as it is no longer required
197
Update the
total_vm
size for this process. The size of the old region is not
important as it is handled within
198-202
do_munmap()
VM_LOCKED ag, all
mark_pages_present()
If the VMA has the
made present with
the pages within the region are
203 Return the address of the new region
205-206 This is the error path.
If a VMA was allocated, delete it
208 Return an out of memory error
D.2.4.4
Function: make_pages_present()
This function makes all pages between
addr
(mm/memory.c)
end present.
and
It assumes that
the two addresses are within the one VMA.
1460 int make_pages_present(unsigned long addr, unsigned long end)
1461 {
1462
int ret, len, write;
1463
struct vm_area_struct * vma;
1464
1465
vma = find_vma(current->mm, addr);
1466
write = (vma->vm_flags & VM_WRITE) != 0;
1467
if (addr >= end)
1468
BUG();
1469
if (end > vma->vm_end)
D.2.4 Remapping and Moving a Memory Region (make_pages_present())
1470
1471
1472
1473
1474
1475 }
272
BUG();
len = (end+PAGE_SIZE-1)/PAGE_SIZE-addr/PAGE_SIZE;
ret = get_user_pages(current, current->mm, addr,
len, write, 0, NULL, NULL);
return ret == len ? 0 : -1;
1465 Find the VMA with find_vma()(See Section D.3.1.1) that contains the starting address
1466 Record if write-access is allowed in write
1467-1468 If the starting address is after the end address, then BUG()
1469-1470 If the range spans more than one VMA its a bug
1471 Calculate the length of the region to fault in
1472
Call
get_user_pages()
to fault in all the pages in the requested region. It
returns the number of pages that were faulted in
1474 Return true if all the requested pages were successfully faulted in
D.2.4.5
Function: get_user_pages()
(mm/memory.c)
This function is used to fault in user pages and may be used to fault in pages
belonging to another process, which is required by
ptrace()
for example.
454 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start,
455
int len, int write, int force, struct page **pages,
struct vm_area_struct **vmas)
456 {
457
int i;
458
unsigned int flags;
459
460
/*
461
* Require read or write permissions.
462
* If 'force' is set, we only require the "MAY" flags.
463
*/
464
flags = write ? (VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
465
flags &= force ? (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
466
i = 0;
467
454 The parameters are:
tsk
is the process that pages are been faulted for
D.2.4 Remapping and Moving a Memory Region (get_user_pages())
273
mm is the mm_struct managing the address space being faulted
start is where to start faulting
len is the length of the region, in pages, to fault
write indicates if the pages are being faulted for writing
force indicates that the pages should be faulted even if the region only has
the
pages
VM_MAYREAD
or
ags
is an array of struct pages which may be NULL. If supplied, the array
will be lled with
vmas
VM_MAYWRITE
struct pages
is similar to the
pages
that were faulted in
array. If supplied, it will be lled with VMAs
that were aected by the faults
464
Set the required ags to
write
VM_WRITE
and
VM_MAYWRITE
ags if the parameter
is set to 1. Otherwise use the read equivilants
465 If force is specied, only require the MAY ags
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
do {
struct vm_area_struct * vma;
vma = find_extend_vma(mm, start);
if ( !vma ||
(pages && vma->vm_flags & VM_IO) ||
!(flags & vma->vm_flags) )
return i ? : -EFAULT;
spin_lock(&mm->page_table_lock);
do {
struct page *map;
while (!(map = follow_page(mm, start, write))) {
spin_unlock(&mm->page_table_lock);
switch (handle_mm_fault(mm, vma, start, write)) {
case 1:
tsk->min_flt++;
break;
case 2:
tsk->maj_flt++;
break;
case 0:
if (i) return i;
return -EFAULT;
default:
if (i) return i;
return -ENOMEM;
D.2.4 Remapping and Moving a Memory Region (get_user_pages())
274
494
}
495
spin_lock(&mm->page_table_lock);
496
}
497
if (pages) {
498
pages[i] = get_page_map(map);
499
/* FIXME: call the correct function,
500
* depending on the type of the found page
501
*/
502
if (!pages[i])
503
goto bad_page;
504
page_cache_get(pages[i]);
505
}
506
if (vmas)
507
vmas[i] = vma;
508
i++;
509
start += PAGE_SIZE;
510
len--;
511
} while(len && start < vma->vm_end);
512
spin_unlock(&mm->page_table_lock);
513
} while(len);
514 out:
515
return i;
468-513 This outer loop will move through every VMA aected by the faults
471 Find the VMA aected by the current value of start.
mented in
473
PAGE_SIZEd
strides
If a VMA does not exist for the address,
struct pages
This variable is incre-
or the caller has requested
for a region that is IO mapped (and therefore not backed by
physical memory) or that the VMA does not have the required ags, then
return
-EFAULT
476 Lock the page table spinlock
479-496 follow_page()(See Section C.2.1) walks the page tables and returns the
struct page
representing the frame mapped at
start.
This loop will only
be entered if the PTE is not present and will keep looping until the PTE is
known to be present with the page table spinlock held
480 Unlock the page table spinlock as handle_mm_fault() is likely to sleep
481 If the page is not present, fault it in with handle_mm_fault()(See Section D.5.3.1)
482-487
Update the
task_struct
statistics indicating if a major or minor fault
occured
488-490 If the faulting address is invalid, return
D.2.4 Remapping and Moving a Memory Region (get_user_pages())
275
491-493 If the system is out of memory, return -ENOMEM
495 Relock the page tables.
The loop will check to make sure the page is actually
present
597-505
If the caller requested it, populate the
pages
array with
struct pages
aected by this function. Each struct will have a reference to it taken with
page_cache_get()
506-507 Similarly, record VMAs aected
508 Increment i which is a counter for the number of pages present in the requested
region
509 Increment start in a page-sized stride
510 Decrement the number of pages that must be faulted in
511 Keep moving through the VMAs until the requested pages have been faulted
in
512 Release the page table spinlock
515 Return the number of pages known to be present in the region
516
517
/*
518
* We found an invalid page in the VMA.
519
* so far and fail.
520
*/
521 bad_page:
522
spin_unlock(&mm->page_table_lock);
523
while (i--)
524
page_cache_release(pages[i]);
525
i = -EFAULT;
526
goto out;
527 }
521
This will only be reached if a
struct page
Release all we have
is found which represents a non-
existant page frame
523-524 If one if found, release references to all pages stored in the pages array
525-526 Return -EFAULT
D.2.4.6 Function: move_page_tables()
D.2.4.6
Function: move_page_tables()
276
(mm/mremap.c)
The call graph for this function is shown in Figure 4.9. This function is responsible copying all the page table entries from the region pointed to be
new_addr.
old_addr
to
It works by literally copying page table entries one at a time. When it
is nished, it deletes all the entries from the old area. This is not the most ecient
way to perform the operation, but it is very easy to error recover.
90 static int move_page_tables(struct mm_struct * mm,
91
unsigned long new_addr, unsigned long old_addr,
unsigned long len)
92 {
93
unsigned long offset = len;
94
95
flush_cache_range(mm, old_addr, old_addr + len);
96
102
while (offset) {
103
offset -= PAGE_SIZE;
104
if (move_one_page(mm, old_addr + offset, new_addr +
offset))
105
goto oops_we_failed;
106
}
107
flush_tlb_range(mm, old_addr, old_addr + len);
108
return 0;
109
117 oops_we_failed:
118
flush_cache_range(mm, new_addr, new_addr + len);
119
while ((offset += PAGE_SIZE) < len)
120
move_one_page(mm, new_addr + offset, old_addr + offset);
121
zap_page_range(mm, new_addr, len);
122
return -1;
123 }
90 The parameters are the mm for the process, the new location, the old location
and the length of the region to move entries for
95 flush_cache_range()
will ush all CPU caches for this range.
It must be
called rst as some architectures, notably Sparc's require that a virtual to
physical mapping exist before ushing the TLB
102-106
This loops through each page in the region and moves the PTE with
move_one_pte()(See
Section D.2.4.7). This translates to a lot of page table
walking and could be performed much better but it is a rare operation
107 Flush the TLB for the old region
108 Return success
D.2.4 Remapping and Moving a Memory Region (move_page_tables())
118-120 This block moves all the PTEs back.
A
277
flush_tlb_range() is not neces-
sary as there is no way the region could have been used yet so no TLB entries
should exist
121 Zap any pages that were allocated for the move
122 Return failure
D.2.4.7
Function: move_one_page()
(mm/mremap.c)
This function is responsible for acquiring the spinlock before nding the correct
PTE with
get_one_pte()
and copying it with
copy_one_pte()
77 static int move_one_page(struct mm_struct *mm,
unsigned long old_addr, unsigned long new_addr)
78 {
79
int error = 0;
80
pte_t * src;
81
82
spin_lock(&mm->page_table_lock);
83
src = get_one_pte(mm, old_addr);
84
if (src)
85
error = copy_one_pte(mm, src, alloc_one_pte(mm, new_addr));
86
spin_unlock(&mm->page_table_lock);
87
return error;
88 }
82 Acquire the mm lock
83
Call
get_one_pte()(See
Section D.2.4.8) which walks the page tables to get
the correct PTE
84-85
If the PTE exists, allocate a PTE for the destination and copy the PTEs
with
copy_one_pte()(See
Section D.2.4.10)
86 Release the lock
87
copy_one_pte() returned.
alloc_one_pte()(See Section D.2.4.9) failed
Return whatever
D.2.4.8
Function: get_one_pte()
It will only return an error if
on line 85
(mm/mremap.c)
This is a very simple page table walk.
18 static inline pte_t *get_one_pte(struct mm_struct *mm,
unsigned long addr)
19 {
20
pgd_t * pgd;
21
pmd_t * pmd;
D.2.4 Remapping and Moving a Memory Region (get_one_pte())
278
22
pte_t * pte = NULL;
23
24
pgd = pgd_offset(mm, addr);
25
if (pgd_none(*pgd))
26
goto end;
27
if (pgd_bad(*pgd)) {
28
pgd_ERROR(*pgd);
29
pgd_clear(pgd);
30
goto end;
31
}
32
33
pmd = pmd_offset(pgd, addr);
34
if (pmd_none(*pmd))
35
goto end;
36
if (pmd_bad(*pmd)) {
37
pmd_ERROR(*pmd);
38
pmd_clear(pmd);
39
goto end;
40
}
41
42
pte = pte_offset(pmd, addr);
43
if (pte_none(*pte))
44
pte = NULL;
45 end:
46
return pte;
47 }
24 Get the PGD for this address
25-26 If no PGD exists, return NULL as no PTE will exist either
27-31
If the PGD is bad, mark that an error occurred in the region, clear its
contents and return NULL
33-40 Acquire the correct PMD in the same fashion as for the PGD
42 Acquire the PTE so it may be returned if it exists
D.2.4.9
Function: alloc_one_pte()
(mm/mremap.c)
Trivial function to allocate what is necessary for one PTE in a region.
49 static inline pte_t *alloc_one_pte(struct mm_struct *mm,
unsigned long addr)
50 {
51
pmd_t * pmd;
52
pte_t * pte = NULL;
D.2.4 Remapping and Moving a Memory Region (alloc_one_pte())
53
54
55
56
57
58 }
279
pmd = pmd_alloc(mm, pgd_offset(mm, addr), addr);
if (pmd)
pte = pte_alloc(mm, pmd, addr);
return pte;
54 If a PMD entry does not exist, allocate it
55-56 If the PMD exists, allocate a PTE entry.
is performed later in the function
D.2.4.10
The check to make sure it succeeded
copy_one_pte()
Function: copy_one_pte()
(mm/mremap.c)
Copies the contents of one PTE to another.
60 static inline int copy_one_pte(struct mm_struct *mm,
pte_t * src, pte_t * dst)
61 {
62
int error = 0;
63
pte_t pte;
64
65
if (!pte_none(*src)) {
66
pte = ptep_get_and_clear(src);
67
if (!dst) {
68
/* No dest? We must put it back. */
69
dst = src;
70
error++;
71
}
72
set_pte(dst, pte);
73
}
74
return error;
75 }
65 If the source PTE does not exist, just return 0 to say the copy was successful
66 Get the PTE and remove it from its old location
67-71 If the dst does not exist,
it means the call to
alloc_one_pte()
the copy operation has failed and must be aborted
72 Move the PTE to its new location
74 Return an error if one occurred
failed and
D.2.5 Deleting a memory region
280
D.2.5 Deleting a memory region
D.2.5.1
Function: do_munmap()
(mm/mmap.c)
The call graph for this function is shown in Figure 4.11. This function is responsible for unmapping a region. If necessary, the unmapping can span multiple VMAs
and it can partially unmap one if necessary. Hence the full unmapping operation
is divided into two major operations. This function is responsible for nding what
VMAs are aected and
unmap_fixup()
is responsible for xing up the remaining
VMAs.
This function is divided up in a number of small sections will be dealt with in
turn. The are broadly speaking;
•
Function preamble and nd the VMA to start working from
•
Take all VMAs aected by the unmapping out of the mm and place them on
a linked list headed by the variable
•
•
free, unmap all the pages in
unmap_fixup() to x up the mappings
Cycle through the list headed by
be unmapped and call
free
the region to
Validate the mm and free memory associated with the unmapping
924 int do_munmap(struct mm_struct *mm, unsigned long addr,
size_t len)
925 {
926
struct vm_area_struct *mpnt, *prev, **npp, *free, *extra;
927
928
if ((addr & ~PAGE_MASK) || addr > TASK_SIZE ||
len > TASK_SIZE-addr)
929
return -EINVAL;
930
931
if ((len = PAGE_ALIGN(len)) == 0)
932
return -EINVAL;
933
939
mpnt = find_vma_prev(mm, addr, &prev);
940
if (!mpnt)
941
return 0;
942
/* we have addr < mpnt->vm_end */
943
944
if (mpnt->vm_start >= addr+len)
945
return 0;
946
948
if ((mpnt->vm_start < addr && mpnt->vm_end > addr+len)
949
&& mm->map_count >= max_map_count)
950
return -ENOMEM;
951
956
extra = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
D.2.5 Deleting a memory region (do_munmap())
957
958
281
if (!extra)
return -ENOMEM;
924 The parameters are as follows;
mm The mm for the processes performing the unmap operation
addr The starting address of the region to unmap
len The length of the region
928-929 Ensure the address is page aligned and that the area to be unmapped is
not in the kernel virtual address space
931-932 Make sure the region size to unmap is page aligned
939 Find the VMA that contains the starting address and the preceding VMA so
it can be easily unlinked later
940-941 If no mpnt was returned, it means the address must be past the last used
VMA so the address space is unused, just return
944-945 If the returned VMA starts past the region we are trying to unmap, then
the region in unused, just return
948-950
The rst part of the check sees if the VMA is just been partially un-
mapped, if it is, another VMA will be created later to deal with a region being
broken into so to the
map_count
has to be checked to make sure it is not too
large
956-958
In case a new mapping is required, it is allocated now as later it will be
much more dicult to back out in event of an error
960
961
962
963
964
965
966
967
968
969
970
npp = (prev ? &prev->vm_next : &mm->mmap);
free = NULL;
spin_lock(&mm->page_table_lock);
for ( ; mpnt && mpnt->vm_start < addr+len; mpnt = *npp) {
*npp = mpnt->vm_next;
mpnt->vm_next = free;
free = mpnt;
rb_erase(&mpnt->vm_rb, &mm->mm_rb);
}
mm->mmap_cache = NULL; /* Kill the cache. */
spin_unlock(&mm->page_table_lock);
This section takes all the VMAs aected by the unmapping and places them on
a separate linked list headed by a variable called
regions much easier.
free.
This makes the xup of the
D.2.5 Deleting a memory region (do_munmap())
960 npp becomes the next VMA in the list during the for loop following below.
282
To
initialise it, it is either the current VMA (mpnt) or else it becomes the rst
VMA in the list
961 free is the head of a linked list of VMAs that are aected by the unmapping
962 Lock the mm
963
Cycle through the list until the start of the current VMA is past the end of
the region to be unmapped
964 npp becomes the next VMA in the list
965-966 Remove the current VMA from the linear linked list within the mm and
place it on a linked list headed by free. The current mpnt becomes the head
of the free linked list
967 Delete mpnt from the red-black tree
969 Remove the cached result in case the last looked up result is one of the regions
to be unmapped
970 Free the mm
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
/* Ok - we have the memory areas we should free on the
* 'free' list, so release them, and unmap the page range..
* If the one of the segments is only being partially unmapped,
* it will put new vm_area_struct(s) into the address space.
* In that case we have to be careful with VM_DENYWRITE.
*/
while ((mpnt = free) != NULL) {
unsigned long st, end, size;
struct file *file = NULL;
free = free->vm_next;
st = addr < mpnt->vm_start ? mpnt->vm_start : addr;
end = addr+len;
end = end > mpnt->vm_end ? mpnt->vm_end : end;
size = end - st;
if (mpnt->vm_flags & VM_DENYWRITE &&
(st != mpnt->vm_start || end != mpnt->vm_end) &&
(file = mpnt->vm_file) != NULL) {
atomic_dec(&file->f_dentry->d_inode->i_writecount);
}
remove_shared_vm_struct(mpnt);
D.2.5 Deleting a memory region (do_munmap())
995
996
997
998
999
1000
283
mm->map_count--;
zap_page_range(mm, st, size);
1001
1002
1003
1004
1005
}
/*
* Fix the mapping, and free the old area
* if it wasn't reused.
*/
extra = unmap_fixup(mm, mpnt, st, size, extra);
if (file)
atomic_inc(&file->f_dentry->d_inode->i_writecount);
978 Keep stepping through the list until no VMAs are left
982
Move free to the next element in the list leaving mpnt as the head about to
be removed
984 st is the start of the region to be unmapped.
the VMA, the starting point is
If the addr is before the start of
mpnt→vm_start,
otherwise it is the supplied
address
985-986 Calculate the end of the region to map in a similar fashion
987 Calculate the size of the region to be unmapped in this pass
989-993
If the
VM_DENYWRITE
ag is specied, a hole will be created by this un-
mapping and a le is mapped then the
i_writecount
is decremented. When
this eld is negative, it counts how many users there is protecting this le from
being opened for writing
994 Remove the le mapping.
again during
If the le is still partially mapped, it will be acquired
unmap_fixup()(See
Section D.2.5.2)
995 Reduce the map count
997 Remove all pages within this region
1002 Call unmap_fixup()(See Section D.2.5.2) to x up the regions after this one
is deleted
1003-1004 Increment the writecount to the le as the region has been unmapped.
If it was just partially unmapped, this call will simply balance out the decrement at line 987
1006
1007
1008
validate_mm(mm);
/* Release the extra vma struct if it wasn't used */
D.2.5 Deleting a memory region (do_munmap())
1009
1010
1011
1012
1013
1014
1015 }
284
if (extra)
kmem_cache_free(vm_area_cachep, extra);
free_pgtables(mm, prev, addr, addr+len);
return 0;
1006 validate_mm() is a debugging function.
If enabled, it will ensure the VMA
tree for this mm is still valid
1009-1010 If extra VMA was not required, delete it
1012 Free all the page tables that were used for the unmapped region
1014 Return success
D.2.5.2
Function: unmap_fixup()
(mm/mmap.c)
This function xes up the regions after a block has been unmapped. It is passed
a list of VMAs that are aected by the unmapping, the region and length to be
unmapped and a spare VMA that may be required to x up the region if a whole
is created.
There is four principle cases it handles; The unmapping of a region,
partial unmapping from the start to somewhere in the middle, partial unmapping
from somewhere in the middle to the end and the creation of a hole in the middle
of the region. Each case will be taken in turn.
787 static struct vm_area_struct * unmap_fixup(struct mm_struct *mm,
788
struct vm_area_struct *area, unsigned long addr, size_t len,
789
struct vm_area_struct *extra)
790 {
791
struct vm_area_struct *mpnt;
792
unsigned long end = addr + len;
793
794
area->vm_mm->total_vm -= len >> PAGE_SHIFT;
795
if (area->vm_flags & VM_LOCKED)
796
area->vm_mm->locked_vm -= len >> PAGE_SHIFT;
797
Function preamble.
787 The parameters to the function are;
mm is the mm the unmapped region belongs to
area is the head of the linked list of VMAs aected by the unmapping
addr is the starting address of the unmapping
len is the length of the region to be unmapped
D.2.5 Deleting a memory region (unmap_fixup())
extra
285
is a spare VMA passed in for when a hole in the middle is created
792 Calculate the end address of the region being unmapped
794 Reduce the count of the number of pages used by the process
795-796 If the pages were locked in memory, reduce the locked page count
798
799
800
801
802
803
804
805
806
/* Unmapping the whole area. */
if (addr == area->vm_start && end == area->vm_end) {
if (area->vm_ops && area->vm_ops->close)
area->vm_ops->close(area);
if (area->vm_file)
fput(area->vm_file);
kmem_cache_free(vm_area_cachep, area);
return extra;
}
The rst, and easiest, case is where the full region is being unmapped
799 The full region is unmapped if the addr is the start of the VMA and the end is
the end of the VMA. This is interesting because if the unmapping is spanning
regions, it is possible the end is
beyond
the end of the VMA but the full of
this VMA is still being unmapped
800-801 If a close operation is supplied by the VMA, call it
802-803
If a le or device is mapped, call
fput()
which decrements the usage
count and releases it if the count falls to 0
804 Free the memory for the VMA back to the slab allocator
805 Return the extra VMA as it was unused
809
810
811
812
813
814
815
816
817
if (end == area->vm_end) {
/*
* here area isn't visible to the semaphore-less readers
* so we don't need to update it under the spinlock.
*/
area->vm_end = addr;
lock_vma_mappings(area);
spin_lock(&mm->page_table_lock);
}
Handle the case where the middle of the region to the end is been unmapped
814 Truncate the VMA back to addr.
At this point, the pages for the region have
already freed and the page table entries will be freed later so no further work
is required
D.2.5 Deleting a memory region (unmap_fixup())
286
815 If a le/device is being mapped, the lock protecting shared access to it is taken
in the function
816
lock_vm_mappings()
Lock the mm. Later in the function, the remaining VMA will be reinserted
into the mm
817
818
819
820
821
822
823
else if (addr == area->vm_start) {
area->vm_pgoff += (end - area->vm_start) >> PAGE_SHIFT;
/* same locking considerations of the above case */
area->vm_start = end;
lock_vma_mappings(area);
spin_lock(&mm->page_table_lock);
} else {
Handle the case where the VMA is been unmapped from the start to some part
in the middle
818 Increase the oset within the le/device mapped by the number of pages this
unmapping represents
820 Move the start of the VMA to the end of the region being unmapped
821-822 Lock the le/device and mm as above
823
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
} else {
/* Add end mapping -- leave beginning for below */
mpnt = extra;
extra = NULL;
mpnt->vm_mm = area->vm_mm;
mpnt->vm_start = end;
mpnt->vm_end = area->vm_end;
mpnt->vm_page_prot = area->vm_page_prot;
mpnt->vm_flags = area->vm_flags;
mpnt->vm_raend = 0;
mpnt->vm_ops = area->vm_ops;
mpnt->vm_pgoff = area->vm_pgoff +
((end - area->vm_start) >> PAGE_SHIFT);
mpnt->vm_file = area->vm_file;
mpnt->vm_private_data = area->vm_private_data;
if (mpnt->vm_file)
get_file(mpnt->vm_file);
if (mpnt->vm_ops && mpnt->vm_ops->open)
mpnt->vm_ops->open(mpnt);
area->vm_end = addr;
/* Truncate area */
D.2.6 Deleting all memory regions
845
846
847
848
849
850
851
}
287
/* Because mpnt->vm_file == area->vm_file this locks
* things correctly.
*/
lock_vma_mappings(area);
spin_lock(&mm->page_table_lock);
__insert_vm_struct(mm, mpnt);
Handle the case where a hole is being created by a partial unmapping. In this
case, the extra VMA is required to create a new mapping from the end of the
unmapped region to the end of the old VMA
826-827 Take the extra VMA and make VMA NULL so that the calling function
will know it is in use and cannot be freed
828-838 Copy in all the VMA information
839 If a le/device is mapped, get a reference to it with get_file()
841-842 If an open function is provided, call it
843 Truncate the VMA so that it ends at the start of the region to be unmapped
848-849 Lock the les and mm as with the two previous cases
850 Insert the extra VMA into the mm
852
853
854
855
856
857 }
__insert_vm_struct(mm, area);
spin_unlock(&mm->page_table_lock);
unlock_vma_mappings(area);
return extra;
853 Reinsert the VMA into the mm
854 Unlock the page tables
855 Unlock the spinlock to the shared mapping
856 Return the extra VMA if it was not used and NULL if it was
D.2.6 Deleting all memory regions
D.2.6.1
Function: exit_mmap()
(mm/mmap.c)
This function simply steps through all VMAs associated with the supplied mm
and unmaps them.
D.2.6 Deleting all memory regions (exit_mmap())
288
1127 void exit_mmap(struct mm_struct * mm)
1128 {
1129
struct vm_area_struct * mpnt;
1130
1131
release_segments(mm);
1132
spin_lock(&mm->page_table_lock);
1133
mpnt = mm->mmap;
1134
mm->mmap = mm->mmap_cache = NULL;
1135
mm->mm_rb = RB_ROOT;
1136
mm->rss = 0;
1137
spin_unlock(&mm->page_table_lock);
1138
mm->total_vm = 0;
1139
mm->locked_vm = 0;
1140
1141
flush_cache_mm(mm);
1142
while (mpnt) {
1143
struct vm_area_struct * next = mpnt->vm_next;
1144
unsigned long start = mpnt->vm_start;
1145
unsigned long end = mpnt->vm_end;
1146
unsigned long size = end - start;
1147
1148
if (mpnt->vm_ops) {
1149
if (mpnt->vm_ops->close)
1150
mpnt->vm_ops->close(mpnt);
1151
}
1152
mm->map_count--;
1153
remove_shared_vm_struct(mpnt);
1154
zap_page_range(mm, start, size);
1155
if (mpnt->vm_file)
1156
fput(mpnt->vm_file);
1157
kmem_cache_free(vm_area_cachep, mpnt);
1158
mpnt = next;
1159
}
1160
flush_tlb_mm(mm);
1161
1162
/* This is just debugging */
1163
if (mm->map_count)
1164
BUG();
1165
1166
clear_page_tables(mm, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
1167 }
1131 release_segments() will release memory segments associated with the process on its Local Descriptor Table (LDT) if the architecture supports segments
and the process was using them. Some applications, notably WINE use this
D.2.6 Deleting all memory regions (exit_mmap())
289
feature
1132 Lock the mm
1133 mpnt becomes the rst VMA on the list
1134 Clear VMA related information from the mm so it may be unlocked
1137 Unlock the mm
1138-1139 Clear the mm statistics
1141 Flush the CPU for the address range
1142-1159 Step through every VMA that was associated with the mm
1143 Record what the next VMA to clear will be so this one may be deleted
1144-1146 Record the start, end and size of the region to be deleted
1148-1151 If there is a close operation associated with this VMA, call it
1152 Reduce the map count
1153 Remove the le/device mapping from the shared mappings list
1154 Free all pages associated with this region
1155-1156 If a le/device was mapped in this region, free it
1157 Free the VMA struct
1158 Move to the next VMA
1160 Flush the TLB for this whole mm as it is about to be unmapped
1163-1164 If the map_count is positive, it means the map count was not accounted
for properly so call
BUG()
to mark it
1166 Clear the page tables associated with this region with clear_page_tables()
(See Section D.2.6.2)
D.2.6.2 Function: clear_page_tables()
D.2.6.2
290
Function: clear_page_tables()
(mm/memory.c)
This is the top-level function used to unmap all PTEs and free pages within a
region. It is used when pagetables needs to be torn down such as when the process
exits or a region is unmapped.
146 void clear_page_tables(struct mm_struct *mm,
unsigned long first, int nr)
147 {
148
pgd_t * page_dir = mm->pgd;
149
150
spin_lock(&mm->page_table_lock);
151
page_dir += first;
152
do {
153
free_one_pgd(page_dir);
154
page_dir++;
155
} while (--nr);
156
spin_unlock(&mm->page_table_lock);
157
158
/* keep the page table cache within bounds */
159
check_pgt_cache();
160 }
148 Get the PGD for the mm being unmapped
150 Lock the pagetables
151-155 Step through all PGDs in the requested range.
free_one_pgd()
For each PGD found, call
(See Section D.2.6.3)
156 Unlock the pagetables
159 Check the cache of available PGD structures.
If there are too many PGDs in
the PGD quicklist, some of them will be reclaimed
D.2.6.3
Function: free_one_pgd()
(mm/memory.c)
This function tears down one PGD. For each PMD in this PGD,
will be called.
109 static inline void free_one_pgd(pgd_t * dir)
110 {
111
int j;
112
pmd_t * pmd;
113
114
if (pgd_none(*dir))
115
return;
116
if (pgd_bad(*dir)) {
free_one_pmd()
D.2.6 Deleting all memory regions (free_one_pgd())
117
118
119
120
121
122
123
124
125
126
127
128 }
291
pgd_ERROR(*dir);
pgd_clear(dir);
return;
}
pmd = pmd_offset(dir, 0);
pgd_clear(dir);
for (j = 0; j < PTRS_PER_PMD ; j++) {
prefetchw(pmd+j+(PREFETCH_STRIDE/16));
free_one_pmd(pmd+j);
}
pmd_free(pmd);
114-115 If no PGD exists here, return
116-120 If the PGD is bad, ag the error and return
1121 Get the rst PMD in the PGD
122 Clear the PGD entry
123-126 For each PMD in this PGD, call free_one_pmd() (See Section D.2.6.4)
127 Free the PMD page to the PMD quicklist.
Later,
check_pgt_cache() will be
called and if the cache has too many PMD pages in it, they will be reclaimed
D.2.6.4
Function: free_one_pmd()
(mm/memory.c)
93 static inline void free_one_pmd(pmd_t * dir)
94 {
95
pte_t * pte;
96
97
if (pmd_none(*dir))
98
return;
99
if (pmd_bad(*dir)) {
100
pmd_ERROR(*dir);
101
pmd_clear(dir);
102
return;
103
}
104
pte = pte_offset(dir, 0);
105
pmd_clear(dir);
106
pte_free(pte);
107 }
97-98 If no PMD exists here, return
99-103 If the PMD is bad, ag the error and return
D.2.6 Deleting all memory regions (free_one_pmd())
292
104 Get the rst PTE in the PMD
105 Clear the PMD from the pagetable
106
Free the PTE page to the PTE quicklist cache with
check_pgt_cache()
pte_free().
Later,
will be called and if the cache has too many PTE pages
in it, they will be reclaimed
D.3 Searching Memory Regions
293
D.3 Searching Memory Regions
Contents
D.3 Searching Memory Regions
D.3.1 Finding a Mapped Memory Region
D.3.1.1 Function: find_vma()
D.3.1.2 Function: find_vma_prev()
D.3.1.3 Function: find_vma_intersection()
D.3.2 Finding a Free Memory Region
D.3.2.1 Function: get_unmapped_area()
D.3.2.2 Function: arch_get_unmapped_area()
293
293
293
294
296
296
296
297
The functions in this section deal with searching the virtual address space for
mapped and free regions.
D.3.1 Finding a Mapped Memory Region
D.3.1.1
Function: find_vma()
(mm/mmap.c)
661 struct vm_area_struct * find_vma(struct mm_struct * mm,
unsigned long addr)
662 {
663
struct vm_area_struct *vma = NULL;
664
665
if (mm) {
666
/* Check the cache first. */
667
/* (Cache hit rate is typically around 35%.) */
668
vma = mm->mmap_cache;
669
if (!(vma && vma->vm_end > addr &&
vma->vm_start <= addr)) {
670
rb_node_t * rb_node;
671
672
rb_node = mm->mm_rb.rb_node;
673
vma = NULL;
674
675
while (rb_node) {
676
struct vm_area_struct * vma_tmp;
677
678
vma_tmp = rb_entry(rb_node,
struct vm_area_struct, vm_rb);
679
680
if (vma_tmp->vm_end > addr) {
681
vma = vma_tmp;
682
if (vma_tmp->vm_start <= addr)
683
break;
684
rb_node = rb_node->rb_left;
D.3.1 Finding a Mapped Memory Region (find_vma())
685
686
687
688
689
690
691
692
693 }
661
294
} else
rb_node = rb_node->rb_right;
}
if (vma)
mm->mmap_cache = vma;
}
}
return vma;
The two parameters are the top level
mm_struct
that is to be searched and
the address the caller is interested in
663 Default to returning NULL for address not found
665 Make sure the caller does not try and search a bogus mm
668 mmap_cache
has the result of the last call to
find_vma().
This has a chance
of not having to search at all through the red-black tree
669 If it is a valid VMA that is being examined, check to see if the address being
searched is contained within it. If it is, the VMA was the
mmap_cache
one so
it can be returned, otherwise the tree is searched
670-674 Start at the root of the tree
675-687 This block is the tree walk
678 The macro, as the name suggests, returns the VMA this tree node points to
680 Check if the next node traversed by the left or right leaf
682 If the current VMA is what is required, exit the while loop
689 If the VMA is valid, set the mmap_cache for the next call to find_vma()
692 Return the VMA that contains the address or as a side eect of the tree walk,
return the VMA that is closest to the requested address
D.3.1.2
Function: find_vma_prev()
(mm/mmap.c)
696 struct vm_area_struct * find_vma_prev(struct mm_struct * mm,
unsigned long addr,
697
struct vm_area_struct **pprev)
698 {
699
if (mm) {
700
/* Go through the RB tree quickly. */
701
struct vm_area_struct * vma;
D.3.1 Finding a Mapped Memory Region (find_vma_prev())
702
703
704
705
706
707
708
709
710
711
295
rb_node_t * rb_node, * rb_last_right, * rb_prev;
rb_node = mm->mm_rb.rb_node;
rb_last_right = rb_prev = NULL;
vma = NULL;
while (rb_node) {
struct vm_area_struct * vma_tmp;
vma_tmp = rb_entry(rb_node,
struct vm_area_struct, vm_rb);
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
if (vma_tmp->vm_end > addr) {
vma = vma_tmp;
rb_prev = rb_last_right;
if (vma_tmp->vm_start <= addr)
break;
rb_node = rb_node->rb_left;
} else {
rb_last_right = rb_node;
rb_node = rb_node->rb_right;
}
733
vma)
734
735
736
737
738
739
740 }
}
if (vma) {
if (vma->vm_rb.rb_left) {
rb_prev = vma->vm_rb.rb_left;
while (rb_prev->rb_right)
rb_prev = rb_prev->rb_right;
}
*pprev = NULL;
if (rb_prev)
*pprev = rb_entry(rb_prev, struct
vm_area_struct, vm_rb);
if ((rb_prev ? (*pprev)->vm_next : mm->mmap) !=
}
BUG();
return vma;
}
*pprev = NULL;
return NULL;
696-723 This is essentially the same as the find_vma() function already described.
The only dierence is that the last right node accesses is remembered as this
D.3.1 Finding a Mapped Memory Region (find_vma_prev())
296
will represent the vma previous to the requested vma.
725-729 If the returned VMA has a left node, it means that it has to be traversed.
It rst takes the left leaf and then follows each right leaf until the bottom of
the tree is found.
731-732 Extract the VMA from the red-black tree node
733-734 A debugging check, if this is the previous node, then its next eld should
point to the VMA being returned. If it is not, it is a bug
D.3.1.3
Function: find_vma_intersection()
(include/linux/mm.h)
673 static inline struct vm_area_struct * find_vma_intersection(
struct mm_struct * mm,
unsigned long start_addr, unsigned long end_addr)
674 {
675
struct vm_area_struct * vma = find_vma(mm,start_addr);
676
677
if (vma && end_addr <= vma->vm_start)
678
vma = NULL;
679
return vma;
680 }
675 Return the VMA closest to the starting address
677
If a VMA is returned and the end address is still less than the beginning of
the returned VMA, the VMA does not intersect
679 Return the VMA if it does intersect
D.3.2 Finding a Free Memory Region
D.3.2.1
Function: get_unmapped_area()
(mm/mmap.c)
The call graph for this function is shown at Figure 4.5.
644 unsigned long get_unmapped_area(struct file *file,
unsigned long addr,
unsigned long len,
unsigned long pgoff,
unsigned long flags)
645 {
646
if (flags & MAP_FIXED) {
647
if (addr > TASK_SIZE - len)
648
return -ENOMEM;
649
if (addr & ~PAGE_MASK)
650
return -EINVAL;
D.3.2 Finding a Free Memory Region (get_unmapped_area())
651
652
653
654
655
297
return addr;
}
if (file && file->f_op && file->f_op->get_unmapped_area)
return file->f_op->get_unmapped_area(file, addr,
len, pgoff, flags);
656
657
658 }
return arch_get_unmapped_area(file, addr, len, pgoff, flags);
644 The parameters passed are
le The le or device being mapped
addr The requested address to map to
len The length of the mapping
pgo The oset within the le being mapped
ags Protection ags
646-652
Sanity checked.
If it is required that the mapping be placed at the
specied address, make sure it will not overow the address space and that it
is page aligned
654 If the struct file provides a get_unmapped_area() function, use it
657
arch_get_unmapped_area()(See Section
of the get_unmapped_area() function
Else use
version
D.3.2.2
Function: arch_get_unmapped_area()
D.3.2.2) as an anonymous
(mm/mmap.c)
Architectures have the option of specifying this function for themselves by dening
HAVE_ARCH_UNMAPPED_AREA. If the architectures does not supply one, this version
is used.
614 #ifndef HAVE_ARCH_UNMAPPED_AREA
615 static inline unsigned long arch_get_unmapped_area(
struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags)
616 {
617
struct vm_area_struct *vma;
618
619
if (len > TASK_SIZE)
620
return -ENOMEM;
621
622
if (addr) {
623
addr = PAGE_ALIGN(addr);
D.3.2 Finding a Free Memory Region (arch_get_unmapped_area())
298
624
vma = find_vma(current->mm, addr);
625
if (TASK_SIZE - len >= addr &&
626
(!vma || addr + len <= vma->vm_start))
627
return addr;
628
}
629
addr = PAGE_ALIGN(TASK_UNMAPPED_BASE);
630
631
for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) {
632
/* At this point: (!vma || addr < vma->vm_end). */
633
if (TASK_SIZE - len < addr)
634
return -ENOMEM;
635
if (!vma || addr + len <= vma->vm_start)
636
return addr;
637
addr = vma->vm_end;
638
}
639 }
640 #else
641 extern unsigned long arch_get_unmapped_area(struct file *,
unsigned long, unsigned long,
unsigned long, unsigned long);
642 #endif
614 If this is not dened, it means that the architecture does not provide its own
arch_get_unmapped_area()
so this one is used instead
615 The parameters are the same as those for get_unmapped_area()(See Section D.3.2.1)
619-620 Sanity check, make sure the required map length is not too long
622-628 If an address is provided, use it for the mapping
623 Make sure the address is page aligned
624 find_vma()(See Section D.3.1.1) will return the region closest to the requested
address
625-627
Make sure the mapping will not overlap with another region. If it does
not, return it as it is safe to use. Otherwise it gets ignored
629 TASK_UNMAPPED_BASE
is the starting point for searching for a free region to
use
631-638
Starting from
TASK_UNMAPPED_BASE,
linearly search the VMAs until a
large enough region between them is found to store the new mapping. This is
essentially a rst t search
641 If an external function is provided, it still needs to be declared here
D.4 Locking and Unlocking Memory Regions
299
D.4 Locking and Unlocking Memory Regions
Contents
D.4 Locking and Unlocking Memory Regions
D.4.1 Locking a Memory Region
D.4.1.1 Function: sys_mlock()
D.4.1.2 Function: sys_mlockall()
D.4.1.3 Function: do_mlockall()
D.4.1.4 Function: do_mlock()
D.4.2 Unlocking the region
D.4.2.1 Function: sys_munlock()
D.4.2.2 Function: sys_munlockall()
D.4.3 Fixing up regions after locking/unlocking
D.4.3.1 Function: mlock_fixup()
D.4.3.2 Function: mlock_fixup_all()
D.4.3.3 Function: mlock_fixup_start()
D.4.3.4 Function: mlock_fixup_end()
D.4.3.5 Function: mlock_fixup_middle()
299
299
299
300
302
303
305
305
306
306
306
308
308
309
310
This section contains the functions related to locking and unlocking a region. The
main complexity in them is how the regions need to be xed up after the operation
takes place.
D.4.1 Locking a Memory Region
D.4.1.1
Function: sys_mlock()
(mm/mlock.c)
The call graph for this function is shown in Figure 4.10. This is the system call
mlock() for locking a region of memory into physical memory.
This function simply
checks to make sure that process and user limits are not exceeeded and that the
region to lock is page aligned.
195 asmlinkage long sys_mlock(unsigned long start, size_t len)
196 {
197
unsigned long locked;
198
unsigned long lock_limit;
199
int error = -ENOMEM;
200
201
down_write(&current->mm->mmap_sem);
202
len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
203
start &= PAGE_MASK;
204
205
locked = len >> PAGE_SHIFT;
206
locked += current->mm->locked_vm;
207
208
lock_limit = current->rlim[RLIMIT_MEMLOCK].rlim_cur;
209
lock_limit >>= PAGE_SHIFT;
D.4.1 Locking a Memory Region (sys_mlock())
300
210
211
/* check against resource limits */
212
if (locked > lock_limit)
213
goto out;
214
215
/* we may lock at most half of physical memory... */
216
/* (this check is pretty bogus, but doesn't hurt) */
217
if (locked > num_physpages/2)
218
goto out;
219
220
error = do_mlock(start, len, 1);
221 out:
222
up_write(&current->mm->mmap_sem);
223
return error;
224 }
201
Take the semaphore, we are likely to sleep during this so a spinlock can not
be used
202 Round the length up to the page boundary
203 Round the start address down to the page boundary
205 Calculate how many pages will be locked
206 Calculate how many pages will be locked in total by this process
208-209 Calculate what the limit is to the number of locked pages
212-213 Do not allow the process to lock more than it should
217-218 Do not allow the process to map more than half of physical memory
220 Call do_mlock()(See Section D.4.1.4) which starts the real
VMA clostest to the area to lock before calling
work by nd the
mlock_fixup()(See
Section D.4.3.1)
222 Free the semaphore
223 Return the error or success code from do_mlock()
D.4.1.2
Function: sys_mlockall()
(mm/mlock.c)
mlockall() which attempts to lock all pages in the calling
MCL_CURRENT is specied, all current pages will be locked. If
This is the system call
process in memory. If
MCL_FUTURE is specied,
all future mappings will be locked. The ags may be or-ed
together. This function makes sure that the ags and process limits are ok before
calling
do_mlockall().
D.4.1 Locking a Memory Region (sys_mlockall())
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
301
asmlinkage long sys_mlockall(int flags)
{
unsigned long lock_limit;
int ret = -EINVAL;
down_write(&current->mm->mmap_sem);
if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
goto out;
lock_limit = current->rlim[RLIMIT_MEMLOCK].rlim_cur;
lock_limit >>= PAGE_SHIFT;
ret = -ENOMEM;
if (current->mm->total_vm > lock_limit)
goto out;
/* we may lock at most half of physical memory... */
/* (this check is pretty bogus, but doesn't hurt) */
if (current->mm->total_vm > num_physpages/2)
goto out;
ret = do_mlockall(flags);
out:
up_write(&current->mm->mmap_sem);
return ret;
}
269 By default, return -EINVAL to indicate invalid parameters
271 Acquire the current mm_struct semaphore
272-273
Make sure that some valid ag has been specied.
unlock the semaphore and return
If not, goto
-EINVAL
out
to
275-276 Check the process limits to see how many pages may be locked
278 From here on, the default error is -ENOMEM
279-280 If the size of the locking would exceed set limits, then goto out
284-285 Do not allow this process to lock more than half of physical memory.
This
is a bogus check because four processes locking a quarter of physical memory
each will bypass this. It is acceptable though as only root proceses are allowed
to lock memory and are unlikely to make this type of mistake
287 Call the core function do_mlockall()(See Section D.4.1.3)
289-290 Unlock the semaphore and return
D.4.1.3 Function: do_mlockall()
D.4.1.3
Function: do_mlockall()
302
(mm/mlock.c)
238 static int do_mlockall(int flags)
239 {
240
int error;
241
unsigned int def_flags;
242
struct vm_area_struct * vma;
243
244
if (!capable(CAP_IPC_LOCK))
245
return -EPERM;
246
247
def_flags = 0;
248
if (flags & MCL_FUTURE)
249
def_flags = VM_LOCKED;
250
current->mm->def_flags = def_flags;
251
252
error = 0;
253
for (vma = current->mm->mmap; vma ; vma = vma->vm_next) {
254
unsigned int newflags;
255
256
newflags = vma->vm_flags | VM_LOCKED;
257
if (!(flags & MCL_CURRENT))
258
newflags &= ~VM_LOCKED;
259
error = mlock_fixup(vma, vma->vm_start, vma->vm_end,
newflags);
260
if (error)
261
break;
262
}
263
return error;
264 }
244-245 The calling process must be either root or have CAP_IPC_LOCK capabilities
248-250 The MCL_FUTURE ag says that all future pages should be locked so if set,
the
def_flags
for VMAs should be
VM_LOCKED
253-262 Cycle through all VMAs
256 Set the VM_LOCKED ag in the current VMA ags
257-258 If the MCL_CURRENT ag has not been set requesting that all current pages
be locked, then clear the
VM_LOCKED
ag.
The logic is arranged like this so
that the unlock code can use this same function just with no ags
259
Call
mlock_fixup()(See
Section D.4.3.1) which will adjust the regions to
match the locking as necessary
D.4.1 Locking a Memory Region (do_mlockall())
303
260-261 If a non-zero value is returned at any point, stop locking.
It is interesting
to note that VMAs already locked will not be unlocked
263 Return the success or error value
D.4.1.4
Function: do_mlock()
(mm/mlock.c)
This function is is responsible for starting the work needed to either lock or
unlock a region depending on the value of the
on
parameter. It is broken up into
two sections. The rst makes sure the region is page aligned (despite the fact the
only two callers of this function do the same thing) before nding the VMA that
is to be adjusted.
mlock_fixup()
The second part then sets the appropriate ags before calling
for each VMA that is aected by this locking.
148 static int do_mlock(unsigned long start, size_t len, int on)
149 {
150
unsigned long nstart, end, tmp;
151
struct vm_area_struct * vma, * next;
152
int error;
153
154
if (on && !capable(CAP_IPC_LOCK))
155
return -EPERM;
156
len = PAGE_ALIGN(len);
157
end = start + len;
158
if (end < start)
159
return -EINVAL;
160
if (end == start)
161
return 0;
162
vma = find_vma(current->mm, start);
163
if (!vma || vma->vm_start > start)
164
return -ENOMEM;
Page align the request and nd the VMA
154 Only root processes can lock pages
156
Page align the length. This is redundent as the length is page aligned in the
parent functions
157-159 Calculate the end of the locking and make sure it is a valid region.
-EINVAL
Return
if it is not
160-161 if locking a region of size 0, just return
162 Find the VMA that will be aected by this locking
163-164 If the VMA for this address range does not exist, return -ENOMEM
D.4.1 Locking a Memory Region (do_mlock())
166
167
168
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193 }
304
for (nstart = start ; ; ) {
unsigned int newflags;
newflags = vma->vm_flags | VM_LOCKED;
if (!on)
newflags &= ~VM_LOCKED;
if (vma->vm_end >= end) {
error = mlock_fixup(vma, nstart, end, newflags);
break;
}
tmp = vma->vm_end;
next = vma->vm_next;
error = mlock_fixup(vma, nstart, tmp, newflags);
if (error)
break;
nstart = tmp;
vma = next;
if (!vma || vma->vm_start != nstart) {
error = -ENOMEM;
break;
}
}
return error;
Walk through the VMAs aected by this locking and call
mlock_fixup()
for
each of them.
166-192 Cycle through as many VMAs as necessary to lock the pages
171 Set the VM_LOCKED ag on the VMA
172-173 Unless this is an unlock in which case, remove the ag
175-177
If this VMA is the last VMA to be aected by the unlocking, call
mlock_fixup()
180-190
with the end address for the locking and exit
Else this is whole VMA needs to be locked. To lock it, the end of this
VMA is pass as a parameter to
mlock_fixup()(See
Section D.4.3.1) instead
of the end of the actual locking
180 tmp is the end of the mapping on this VMA
181 next is the next VMA that will be aected by the locking
D.4.2 Unlocking the region
305
182 Call mlock_fixup()(See Section D.4.3.1) for this VMA
183-184 If an error occurs, back out.
Note that the VMAs already locked are not
xed up right
185 The next start address is the start of the next VMA
186 Move to the next VMA
187-190 If there is no VMA , return -ENOMEM. The next condition though would require the regions to be extremly broken as a result of a broken implementation
of
mlock_fixup()
or have VMAs that overlap
192 Return the error or success value
D.4.2 Unlocking the region
D.4.2.1
Function: sys_munlock()
Page align the request before calling
(mm/mlock.c)
do_mlock() which
begins the real work of
xing up the regions.
226 asmlinkage long sys_munlock(unsigned long start, size_t len)
227 {
228
int ret;
229
230
down_write(&current->mm->mmap_sem);
231
len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
232
start &= PAGE_MASK;
233
ret = do_mlock(start, len, 0);
234
up_write(&current->mm->mmap_sem);
235
return ret;
236 }
230 Acquire the semaphore protecting the mm_struct
231 Round the length of the region up to the nearest page boundary
232 Round the start of the region down to the nearest page boundary
233 Call do_mlock()(See Section D.4.1.4) with 0 as the third parameter to unlock
the region
234 Release the semaphore
235 Return the success or failure code
D.4.2.2 Function: sys_munlockall()
D.4.2.2
306
Function: sys_munlockall()
Trivial function. If the ags to
(mm/mlock.c)
mlockall() are 0 it gets
translated as none of
the current pages must be present and no future mappings should be locked either
which means the
VM_LOCKED
ag will be removed on all VMAs.
293 asmlinkage long sys_munlockall(void)
294 {
295
int ret;
296
297
down_write(&current->mm->mmap_sem);
298
ret = do_mlockall(0);
299
up_write(&current->mm->mmap_sem);
300
return ret;
301 }
297 Acquire the semaphore protecting the mm_struct
298
do_mlockall()(See Section
VM_LOCKED from all VMAs
Call
the
D.4.1.3) with 0 as ags which will remove
299 Release the semaphore
300 Return the error or success code
D.4.3 Fixing up regions after locking/unlocking
D.4.3.1
Function: mlock_fixup()
(mm/mlock.c)
This function identies four separate types of locking that must be addressed.
There rst is where the full VMA is to be locked where it calls
mlock_fixup_all().
The second is where only the beginning portion of the VMA is aected, handled by
mlock_fixup_start(). The third is the locking of a region at the end handled by
mlock_fixup_end() and the last is locking a region in the middle of the VMA with
mlock_fixup_middle().
117 static int mlock_fixup(struct vm_area_struct * vma,
118
unsigned long start, unsigned long end, unsigned int newflags)
119 {
120
int pages, retval;
121
122
if (newflags == vma->vm_flags)
123
return 0;
124
125
if (start == vma->vm_start) {
126
if (end == vma->vm_end)
127
retval = mlock_fixup_all(vma, newflags);
128
else
D.4.3 Fixing up regions after locking/unlocking (mlock_fixup())
129
130
131
132
133
134
retval
} else {
if (end ==
retval
else
retval
135
136
137
138
139
140
141
142
143
144
145
146 }
}
if (!retval) {
/* keep track of amount of locked VM */
pages = (end - start) >> PAGE_SHIFT;
if (newflags & VM_LOCKED) {
pages = -pages;
make_pages_present(start, end);
}
vma->vm_mm->locked_vm -= pages;
}
return retval;
307
= mlock_fixup_start(vma, end, newflags);
vma->vm_end)
= mlock_fixup_end(vma, start, newflags);
= mlock_fixup_middle(vma, start,
end, newflags);
122-123 If no change is to be made, just return
125 If the start of the locking is at the start of the VMA, it means that either the
full region is to the locked or only a portion at the beginning
126-127 The full VMA is being locked, call mlock_fixup_all() (See Section D.4.3.2)
128-129 Part of the VMA is being locked with the start of the VMA matching the
start of the locking, call
mlock_fixup_start()
(See Section D.4.3.3)
130 Else either the a region at the end is to be locked or a region in the middle
131-132 The end of the locking matches the end of the VMA, call mlock_fixup_end()
(See Section D.4.3.4)
133-134 A region in the middle of the VMA is to be locked, call mlock_fixup_middle()
(See Section D.4.3.5)
136-144 The xup functions return 0 on success.
If the xup of the regions succeed
make_pages_present() which
get_user_pages() which faults in all
and the regions are now marked as locked, call
makes some basic checks before calling
the pages in the same way the page fault handler does
D.4.3.2 Function: mlock_fixup_all()
D.4.3.2
Function: mlock_fixup_all()
308
(mm/mlock.c)
15 static inline int mlock_fixup_all(struct vm_area_struct * vma,
int newflags)
16 {
17
spin_lock(&vma->vm_mm->page_table_lock);
18
vma->vm_flags = newflags;
19
spin_unlock(&vma->vm_mm->page_table_lock);
20
return 0;
21 }
17-19 Trivial, lock the VMA with the spinlock, set the new ags, release the lock
and return success
D.4.3.3
Function: mlock_fixup_start()
Slightly more compilcated.
(mm/mlock.c)
A new VMA is required to represent the aected
region. The start of the old VMA is moved forward
23 static inline int mlock_fixup_start(struct vm_area_struct * vma,
24
unsigned long end, int newflags)
25 {
26
struct vm_area_struct * n;
27
28
n = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
29
if (!n)
30
return -EAGAIN;
31
*n = *vma;
32
n->vm_end = end;
33
n->vm_flags = newflags;
34
n->vm_raend = 0;
35
if (n->vm_file)
36
get_file(n->vm_file);
37
if (n->vm_ops && n->vm_ops->open)
38
n->vm_ops->open(n);
39
vma->vm_pgoff += (end - vma->vm_start) >> PAGE_SHIFT;
40
lock_vma_mappings(vma);
41
spin_lock(&vma->vm_mm->page_table_lock);
42
vma->vm_start = end;
43
__insert_vm_struct(current->mm, n);
44
spin_unlock(&vma->vm_mm->page_table_lock);
45
unlock_vma_mappings(vma);
46
return 0;
47 }
28 Allocate a VMA from the slab allocator for the aected region
D.4.3 Fixing up regions after locking/unlocking (mlock_fixup_start())
309
31-34 Copy in the necessary information
35-36
If the VMA has a le or device mapping,
get_file()
will increment the
reference count
37-38 If an open() function is provided, call it
39 Update the oset within the le or device mapping for the old VMA to be the
end of the locked region
40 lock_vma_mappings() will lock any les if this VMA is a shared region
41-44
Lock the parent
mm_struct,
update its start to be the end of the aected
region, insert the new VMA into the processes linked lists (See Section D.2.2.1)
and release the lock
45 Unlock the le mappings with unlock_vma_mappings()
46 Return success
D.4.3.4
Function: mlock_fixup_end()
Essentially the same as
(mm/mlock.c)
mlock_fixup_start() except the
aected region is at
the end of the VMA.
49 static inline int mlock_fixup_end(struct vm_area_struct * vma,
50
unsigned long start, int newflags)
51 {
52
struct vm_area_struct * n;
53
54
n = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
55
if (!n)
56
return -EAGAIN;
57
*n = *vma;
58
n->vm_start = start;
59
n->vm_pgoff += (n->vm_start - vma->vm_start) >> PAGE_SHIFT;
60
n->vm_flags = newflags;
61
n->vm_raend = 0;
62
if (n->vm_file)
63
get_file(n->vm_file);
64
if (n->vm_ops && n->vm_ops->open)
65
n->vm_ops->open(n);
66
lock_vma_mappings(vma);
67
spin_lock(&vma->vm_mm->page_table_lock);
68
vma->vm_end = start;
69
__insert_vm_struct(current->mm, n);
70
spin_unlock(&vma->vm_mm->page_table_lock);
71
unlock_vma_mappings(vma);
D.4.3 Fixing up regions after locking/unlocking (mlock_fixup_end())
72
73 }
310
return 0;
54 Alloc a VMA from the slab allocator for the aected region
57-61
Copy in the necessary information and update the oset within the le or
device mapping
62-63
If the VMA has a le or device mapping,
get_file()
will increment the
reference count
64-65 If an open() function is provided, call it
66 lock_vma_mappings() will lock any les if this VMA is a shared region
67-70
Lock the parent
mm_struct,
update its start to be the end of the aected
region, insert the new VMA into the processes linked lists (See Section D.2.2.1)
and release the lock
71 Unlock the le mappings with unlock_vma_mappings()
72 Return success
D.4.3.5
Function: mlock_fixup_middle()
(mm/mlock.c)
Similar to the previous two xup functions except that 2 new regions are required
to x up the mapping.
75 static inline int mlock_fixup_middle(struct vm_area_struct * vma,
76
unsigned long start, unsigned long end, int newflags)
77 {
78
struct vm_area_struct * left, * right;
79
80
left = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
81
if (!left)
82
return -EAGAIN;
83
right = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
84
if (!right) {
85
kmem_cache_free(vm_area_cachep, left);
86
return -EAGAIN;
87
}
88
*left = *vma;
89
*right = *vma;
90
left->vm_end = start;
91
right->vm_start = end;
92
right->vm_pgoff += (right->vm_start - left->vm_start) >>
PAGE_SHIFT;
93
vma->vm_flags = newflags;
D.4.3 Fixing up regions after locking/unlocking (mlock_fixup_middle())
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115 }
311
left->vm_raend = 0;
right->vm_raend = 0;
if (vma->vm_file)
atomic_add(2, &vma->vm_file->f_count);
if (vma->vm_ops && vma->vm_ops->open) {
vma->vm_ops->open(left);
vma->vm_ops->open(right);
}
vma->vm_raend = 0;
vma->vm_pgoff += (start - vma->vm_start) >> PAGE_SHIFT;
lock_vma_mappings(vma);
spin_lock(&vma->vm_mm->page_table_lock);
vma->vm_start = start;
vma->vm_end = end;
vma->vm_flags = newflags;
__insert_vm_struct(current->mm, left);
__insert_vm_struct(current->mm, right);
spin_unlock(&vma->vm_mm->page_table_lock);
unlock_vma_mappings(vma);
return 0;
80-87 Allocate the two new VMAs from the slab allocator
88-89 Copy in the information from the old VMA into them
90 The end of the left region is the start of the region to be aected
91 The start of the right region is the end of the aected region
92 Update the le oset
93 The old VMA is now the aected region so update its ags
94-95 Make the readahead window 0 to ensure pages not belonging to their regions
are not accidently read ahead
96-97 Increment the reference count to the le/device mapping if there is one
99-102 Call the open() function for the two new mappings
103-104 Cancel the readahead window and update the oset within the le to be
the beginning of the locked region
105 Lock the shared le/device mappings
106-112
Lock the parent
mm_struct,
update the VMA and insert the two new
regions into the process before releasing the lock again
D.4.3 Fixing up regions after locking/unlocking (mlock_fixup_middle())
113 Unlock the shared mappings
114 Return success
312
D.5 Page Faulting
313
D.5 Page Faulting
Contents
D.5 Page Faulting
D.5.1 x86 Page Fault Handler
D.5.1.1 Function: do_page_fault()
D.5.2 Expanding the Stack
D.5.2.1 Function: expand_stack()
D.5.3 Architecture Independent Page Fault Handler
D.5.3.1 Function: handle_mm_fault()
D.5.3.2 Function: handle_pte_fault()
D.5.4 Demand Allocation
D.5.4.1 Function: do_no_page()
D.5.4.2 Function: do_anonymous_page()
D.5.5 Demand Paging
D.5.5.1 Function: do_swap_page()
D.5.5.2 Function: can_share_swap_page()
D.5.5.3 Function: exclusive_swap_page()
D.5.6 Copy On Write (COW) Pages
D.5.6.1 Function: do_wp_page()
This section deals with the page fault handler.
313
313
313
323
323
324
324
326
327
327
330
332
332
336
337
338
338
It begins with the architecture
specic function for the x86 and then moves to the architecture independent layer.
The architecture specic functions all have the same responsibilities.
D.5.1 x86 Page Fault Handler
D.5.1.1
Function: do_page_fault()
(arch/i386/mm/fault.c)
The call graph for this function is shown in Figure 4.12. This function is the x86
architecture dependent function for the handling of page fault exception handlers.
Each architecture registers their own but all of them have similar responsibilities.
140 asmlinkage void do_page_fault(struct pt_regs *regs,
unsigned long error_code)
141 {
142
struct task_struct *tsk;
143
struct mm_struct *mm;
144
struct vm_area_struct * vma;
145
unsigned long address;
146
unsigned long page;
147
unsigned long fixup;
148
int write;
149
siginfo_t info;
150
151
/* get the address */
152
__asm__("movl %%cr2,%0":"=r" (address));
D.5.1 x86 Page Fault Handler (do_page_fault())
153
154
155
156
157
158
159
314
/* It's safe to allow irq's after cr2 has been saved */
if (regs->eflags & X86_EFLAGS_IF)
local_irq_enable();
tsk = current;
Function preamble. Get the fault address and enable interrupts
140 The parameters are
regs is a struct containing what all the registers at fault time
error_code indicates what sort of fault occurred
152 As the comment indicates, the cr2 register holds the fault address
155-156 If the fault is from within an interrupt, enable them
158 Set the current task
173
174
175
176
177
178
183
184
185
if (address >= TASK_SIZE && !(error_code & 5))
goto vmalloc_fault;
mm = tsk->mm;
info.si_code = SEGV_MAPERR;
if (in_interrupt() || !mm)
goto no_context;
Check for exceptional faults, kernel faults, fault in interrupt and fault with no
memory context
173
If the fault address is over
TASK_SIZE,
it is within the kernel address space.
If the error code is 5, then it means it happened while in kernel mode and is
not a protection error so handle a vmalloc fault
176 Record the working mm
183
If this is an interrupt, or there is no memory context (such as with a kernel
thread), there is no way to safely handle the fault so goto
186
187
188
189
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
if (!vma)
no_context
D.5.1 x86 Page Fault Handler (do_page_fault())
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
315
goto bad_area;
if (vma->vm_start <= address)
goto good_area;
if (!(vma->vm_flags & VM_GROWSDOWN))
goto bad_area;
if (error_code & 4) {
/*
* accessing the stack below %esp is always a bug.
* The "+ 32" is there due to some instructions (like
* pusha) doing post-decrement on the stack and that
* doesn't show up until later..
*/
if (address + 32 < regs->esp)
goto bad_area;
}
if (expand_stack(vma, address))
goto bad_area;
If a fault in userspace, nd the VMA for the faulting address and determine if it
is a good area, a bad area or if the fault occurred near a region that can be expanded
such as the stack
186 Take the long lived mm semaphore
188 Find the VMA that is responsible or is closest to the faulting address
189-190 If a VMA does not exist at all, goto bad_area
191-192 If the start of the region is before the address, it means this VMA is the
correct VMA for the fault so goto
good_area which will check the permissions
193-194 For the region that is closest, check if it can gown down (VM_GROWSDOWN).
If it does, it means the stack can probably be expanded. If not, goto
195-204 Check to make sure it isn't an access below the stack.
if the
bad_area
error_code
is 4, it means it is running in userspace
205-206
VM_GROWSDOWN set so if we reach here,
expand_stack()(See Section D.5.2.1), if it
The stack is the only region with
the stack is expaneded with with
fails, goto
bad_area
211 good_area:
212
info.si_code = SEGV_ACCERR;
213
write = 0;
214
switch (error_code & 3) {
215
default:
/* 3: write, present */
216 #ifdef TEST_VERIFY_AREA
D.5.1 x86 Page Fault Handler (do_page_fault())
316
217
if (regs->cs == KERNEL_CS)
218
printk("WP fault at %08lx\n", regs->eip);
219 #endif
220
/* fall through */
221
case 2:
/* write, not present */
222
if (!(vma->vm_flags & VM_WRITE))
223
goto bad_area;
224
write++;
225
break;
226
case 1:
/* read, present */
227
goto bad_area;
228
case 0:
/* read, not present */
229
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
230
goto bad_area;
231
}
There is the rst part of a good area is handled. The permissions need to be
checked in case this is a protection fault.
212 By default return an error
214 Check the error code against bits 0 and 1 of the error code.
page was not present.
Bit 0 at 0 means
At 1, it means a protection fault like a write to a
read-only area. Bit 1 is 0 if it was a read fault and 1 if a write
215 If it is 3, both bits are 1 so it is a write protection fault
221 Bit 1 is a 1 so it is a write fault
222-223
If the region can not be written to, it is a bad write to goto
bad_area.
If the region can be written to, this is a page that is marked Copy On Write
(COW)
224 Flag that a write has occurred
226-227
This is a read and the page is present. There is no reason for the fault
so must be some other type of exception like a divide by zero, goto
bad_area
where it is handled
228-230
A read occurred on a missing page. Make sure it is ok to read or exec
this page. If not, goto
bad_area.
The check for exec is made because the x86
can not exec protect a page and instead uses the read protect ag.
why both have to be checked
233
239
240
survive:
switch (handle_mm_fault(mm, vma, address, write)) {
case 1:
This is
D.5.1 x86 Page Fault Handler (do_page_fault())
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
317
tsk->min_flt++;
break;
case 2:
tsk->maj_flt++;
break;
case 0:
goto do_sigbus;
default:
goto out_of_memory;
}
/*
* Did it hit the DOS screen memory VA from vm86 mode?
*/
if (regs->eflags & VM_MASK) {
unsigned long bit = (address - 0xA0000) >> PAGE_SHIFT;
if (bit < 32)
tsk->thread.screen_bitmap |= 1 << bit;
}
up_read(&mm->mmap_sem);
return;
At this point, an attempt is going to be made to handle the fault gracefully with
handle_mm_fault().
239 Call handle_mm_fault() with the relevant information about the fault.
This
is the architecture independent part of the handler
240-242 A return of 1 means it was a minor fault.
Update statistics
243-245 A return of 2 means it was a major fault.
Update statistics
246-247
A return of 0 means some IO error happened during the fault so go to
the do_sigbus handler
248-249
Any other return means memory could not be allocated for the fault
so we are out of memory.
tion
In reality this does not happen as another func-
out_of_memory() is invoked in mm/oom_kill.c before this could happen
which is a lot more graceful about who it kills
260 Release the lock to the mm
261 Return as the fault has been successfully handled
267 bad_area:
268
up_read(&mm->mmap_sem);
269
D.5.1 x86 Page Fault Handler (do_page_fault())
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
318
/* User mode accesses just cause a SIGSEGV */
if (error_code & 4) {
tsk->thread.cr2 = address;
tsk->thread.error_code = error_code;
tsk->thread.trap_no = 14;
info.si_signo = SIGSEGV;
info.si_errno = 0;
/* info.si_code has been set above */
info.si_addr = (void *)address;
force_sig_info(SIGSEGV, &info, tsk);
return;
}
/*
* Pentium F0 0F C7 C8 bug workaround.
*/
if (boot_cpu_data.f00f_bug) {
unsigned long nr;
nr = (address - idt) >> 3;
}
if (nr == 6) {
do_invalid_op(regs, 0);
return;
}
This is the bad area handler such as using memory with no
vm_area_struct
managing it. If the fault is not by a user process or the f00f bug, the no_context
label is fallen through to.
271 An error code of 4 implies userspace so it is a simple case of sending a SIGSEGV
to kill the process
272-274
Set thread information about what happened which can be read by a
debugger later
275 Record that a SIGSEGV signal was sent
276 clear errno
278 Record the address
279
Send the
SIGSEGV
signal.
The process will exit and dump all the relevant
information
280 Return as the fault has been successfully handled
D.5.1 x86 Page Fault Handler (do_page_fault())
286-295
319
An bug in the rst Pentiums was called the f00f bug which caused the
processor to constantly page fault.
It was used as a local DoS attack on a
running Linux system. This bug was trapped within a few hours and a patch
released. Now it results in a harmless termination of the process rather than
a rebooting system
296
297 no_context:
298
/* Are we prepared to handle this kernel fault? */
299
if ((fixup = search_exception_table(regs->eip)) != 0) {
300
regs->eip = fixup;
301
return;
302
}
299-302 Search the exception table with search_exception_table() to see if this
exception be handled and if so, call the proper exception handler after returning. This is really important during
copy_from_user() and copy_to_user()
when an exception handler is especially installed to trap reads and writes to invalid regions in userspace without having to make expensive checks. It means
that a small xup block of code can be called rather than falling through to
the next block which causes an oops
304 /*
305 * Oops. The kernel tried to access some bad page. We'll have to
306 * terminate things with extreme prejudice.
307 */
308
309
bust_spinlocks(1);
310
311
if (address < PAGE_SIZE)
312
printk(KERN_ALERT "Unable to handle kernel NULL pointer
dereference");
313
else
314
printk(KERN_ALERT "Unable to handle kernel paging
request");
315
printk(" at virtual address %08lx\n",address);
316
printk(" printing eip:\n");
317
printk("%08lx\n", regs->eip);
318
asm("movl %%cr3,%0":"=r" (page));
319
page = ((unsigned long *) __va(page))[address >> 22];
320
printk(KERN_ALERT "*pde = %08lx\n", page);
321
if (page & 1) {
322
page &= PAGE_MASK;
323
address &= 0x003ff000;
324
page = ((unsigned long *)
D.5.1 x86 Page Fault Handler (do_page_fault())
325
326
327
328
329
320
__va(page))[address >> PAGE_SHIFT];
printk(KERN_ALERT "*pte = %08lx\n", page);
}
die("Oops", regs, error_code);
bust_spinlocks(0);
do_exit(SIGKILL);
This is the
no_context handler.
Some bad exception occurred which is going to
end up in the process been terminated in all likeliness. Otherwise the kernel faulted
when it denitely should have and an OOPS report is generated.
309-329
Otherwise the kernel faulted when it really shouldn't have and it is a
kernel bug. This block generates an oops report
309 Forcibly free spinlocks which might prevent a message getting to console
311-312
If the address is
< PAGE_SIZE,
it means that a null pointer was used.
Linux deliberately has page 0 unassigned to trap this type of fault which is a
common programming error
313-314 Otherwise it is just some bad kernel error such as a driver trying to access
userspace incorrectly
315-320 Print out information about the fault
321-326 Print out information about the page been faulted
327 Die and generate an oops report which can be used later to get a stack trace
so a developer can see more accurately where and how the fault occurred
329 Forcibly kill the faulting process
335 out_of_memory:
336
if (tsk->pid == 1) {
337
yield();
338
goto survive;
339
}
340
up_read(&mm->mmap_sem);
341
printk("VM: killing process %s\n", tsk->comm);
342
if (error_code & 4)
343
do_exit(SIGKILL);
344
goto no_context;
The out of memory handler. Usually ends with the faulting process getting killed
unless it is init
336-339 If the process is init, just yield and goto survive which will try to handle
the fault gracefully. init should never be killed
D.5.1 x86 Page Fault Handler (do_page_fault())
321
340 Free the mm semaphore
341 Print out a helpful You are Dead
message
342 If from userspace, just kill the process
344 If in kernel space, go to the no_context handler which in this case will probably
result in a kernel oops
345
346 do_sigbus:
347
up_read(&mm->mmap_sem);
348
353
tsk->thread.cr2 = address;
354
tsk->thread.error_code = error_code;
355
tsk->thread.trap_no = 14;
356
info.si_signo = SIGBUS;
357
info.si_errno = 0;
358
info.si_code = BUS_ADRERR;
359
info.si_addr = (void *)address;
360
force_sig_info(SIGBUS, &info, tsk);
361
362
/* Kernel mode? Handle exceptions or die */
363
if (!(error_code & 4))
364
goto no_context;
365
return;
347 Free the mm lock
353-359 Fill in information to show a SIGBUS occurred at the faulting address so
that a debugger can trap it later
360 Send the signal
363-364 If in kernel mode, try and handle the exception during no_context
365 If in userspace, just return and the process will die in due course
D.5.1 x86 Page Fault Handler (do_page_fault())
322
367 vmalloc_fault:
368
{
376
int offset = __pgd_offset(address);
377
pgd_t *pgd, *pgd_k;
378
pmd_t *pmd, *pmd_k;
379
pte_t *pte_k;
380
381
asm("movl %%cr3,%0":"=r" (pgd));
382
pgd = offset + (pgd_t *)__va(pgd);
383
pgd_k = init_mm.pgd + offset;
384
385
if (!pgd_present(*pgd_k))
386
goto no_context;
387
set_pgd(pgd, *pgd_k);
388
389
pmd = pmd_offset(pgd, address);
390
pmd_k = pmd_offset(pgd_k, address);
391
if (!pmd_present(*pmd_k))
392
goto no_context;
393
set_pmd(pmd, *pmd_k);
394
395
pte_k = pte_offset(pmd_k, address);
396
if (!pte_present(*pte_k))
397
goto no_context;
398
return;
399
}
400 }
This is the vmalloc fault handler. When pages are mapped in the vmalloc space,
only the refernce page table is updated. As each process references this area, a fault
will be trapped and the process page tables will be synchronised with the reference
page table here.
376 Get the oset within a PGD
381 Copy the address of the PGD for the process from the cr3 register to pgd
382 Calculate the pgd pointer from the process PGD
383 Calculate for the kernel reference PGD
385-386 If the pgd entry is invalid for the kernel page table, goto no_context
386 Set the page table entry in the process page table with a copy from the kernel
reference page table
D.5.2 Expanding the Stack
389-393
323
Same idea for the PMD. Copy the page table entry from the kernel ref-
erence page table to the process page tables
395 Check the PTE
396-397
If it is not present, it means the page was not valid even in the kernel
reference page table so goto
no_context
to handle what is probably a kernel
bug, probably a reference to a random part of unused kernel space
398 Otherwise return knowing the process page tables have been updated and are
in sync with the kernel page tables
D.5.2 Expanding the Stack
D.5.2.1
Function: expand_stack()
(include/linux/mm.h)
This function is called by the architecture dependant page fault handler. The
VMA supplied is guarenteed to be one that can grow to cover the address.
640 static inline int expand_stack(struct vm_area_struct * vma,
unsigned long address)
641 {
642
unsigned long grow;
643
644
/*
645
* vma->vm_start/vm_end cannot change under us because
* the caller is required
646
* to hold the mmap_sem in write mode. We need to get the
647
* spinlock only before relocating the vma range ourself.
648
*/
649
address &= PAGE_MASK;
650
spin_lock(&vma->vm_mm->page_table_lock);
651
grow = (vma->vm_start - address) >> PAGE_SHIFT;
652
if (vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur ||
653
((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
current->rlim[RLIMIT_AS].rlim_cur) {
654
spin_unlock(&vma->vm_mm->page_table_lock);
655
return -ENOMEM;
656
}
657
vma->vm_start = address;
658
vma->vm_pgoff -= grow;
659
vma->vm_mm->total_vm += grow;
660
if (vma->vm_flags & VM_LOCKED)
661
vma->vm_mm->locked_vm += grow;
662
spin_unlock(&vma->vm_mm->page_table_lock);
663
return 0;
664 }
D.5.3 Architecture Independent Page Fault Handler
324
649 Round the address down to the nearest page boundary
650 Lock the page tables spinlock
651 Calculate how many pages the stack needs to grow by
652
Check to make sure that the size of the stack does not exceed the process
limits
653
Check to make sure that the size of the addres space will not exceed process
limits after the stack is grown
654-655
If either of the limits are reached, return
-ENOMEM
which will cause the
faulting process to segfault
657-658 Grow the VMA down
659 Update the amount of address space used by the process
660-661
If the region is locked, update the number of locked pages used by the
process
662-663 Unlock the process page tables and return success
D.5.3 Architecture Independent Page Fault Handler
This is the top level pair of functions for the architecture independent page fault
handler.
D.5.3.1
Function: handle_mm_fault()
(mm/memory.c)
The call graph for this function is shown in Figure 4.14. This function allocates
the PMD and PTE necessary for this new PTE hat is about to be allocated. It takes
the necessary locks to protect the page tables before calling
handle_pte_fault()
to fault in the page itself.
1364 int handle_mm_fault(struct mm_struct *mm,
struct vm_area_struct * vma,
1365
unsigned long address, int write_access)
1366 {
1367
pgd_t *pgd;
1368
pmd_t *pmd;
1369
1370
current->state = TASK_RUNNING;
1371
pgd = pgd_offset(mm, address);
1372
1373
/*
1374
* We need the page table lock to synchronize with kswapd
1375
* and the SMP-safe atomic PTE updates.
D.5.3 Architecture Independent Page Fault Handler (handle_mm_fault())
1376
1377
1378
1379
1380
1381
1382
1383
325
*/
spin_lock(&mm->page_table_lock);
pmd = pmd_alloc(mm, pgd, address);
1384
1385
1386
1387 }
if (pmd) {
pte_t * pte = pte_alloc(mm, pmd, address);
if (pte)
return handle_pte_fault(mm, vma, address,
write_access, pte);
}
spin_unlock(&mm->page_table_lock);
return -1;
1364 The parameters of the function are;
mm is the mm_struct for the faulting process
vma is the vm_area_struct managing the region the fault occurred in
address is the faulting address
write_access is 1 if the fault is a write fault
1370 Set the current state of the process
1371 Get the pgd entry from the top level page table
1377 Lock the mm_struct as the page tables will change
1378 pmd_alloc() will allocate a pmd_t if one does not already exist
1380 If the pmd has been successfully allocated then...
1381 Allocate a PTE for this address if one does not already exist
1382-1383 Handle the page fault with handle_pte_fault() (See Section D.5.3.2)
and return the status code
1385 Failure path, unlock the mm_struct
1386 Return -1 which will be interpreted as an out of memory condition which is
correct as this line is only reached if a PMD or PTE could not be allocated
D.5.3.2 Function: handle_pte_fault()
D.5.3.2
Function: handle_pte_fault()
326
(mm/memory.c)
This function decides what type of fault this is and which function should han-
do_no_page() is called if this is the rst time a page is to be allocated.
do_swap_page() handles the case where the page was swapped out to disk with the
exception of pages swapped out from tmpfs. do_wp_page() breaks COW pages. If
dle it.
none of them are appropriate, the PTE entry is simply updated. If it was written
to, it is marked dirty and it is marked accessed to show it is a young page.
1331 static inline int handle_pte_fault(struct mm_struct *mm,
1332
struct vm_area_struct * vma, unsigned long address,
1333
int write_access, pte_t * pte)
1334 {
1335
pte_t entry;
1336
1337
entry = *pte;
1338
if (!pte_present(entry)) {
1339
/*
1340
* If it truly wasn't present, we know that kswapd
1341
* and the PTE updates will not touch it later. So
1342
* drop the lock.
1343
*/
1344
if (pte_none(entry))
1345
return do_no_page(mm, vma, address,
write_access, pte);
1346
return do_swap_page(mm, vma, address, pte, entry,
write_access);
1347
}
1348
1349
if (write_access) {
1350
if (!pte_write(entry))
1351
return do_wp_page(mm, vma, address, pte, entry);
1352
1353
entry = pte_mkdirty(entry);
1354
}
1355
entry = pte_mkyoung(entry);
1356
establish_pte(vma, address, pte, entry);
1357
spin_unlock(&mm->page_table_lock);
1358
return 1;
1359 }
1331 The parameters of the function are the same as those for handle_mm_fault()
except the PTE for the fault is included
1337 Record the PTE
1338 Handle the case where the PTE is not present
D.5.4 Demand Allocation
1344
If the PTE has never been lled, handle the allocation of the PTE with
do_no_page()(See
1346
327
Section D.5.4.1)
If the page has been swapped out to backing storage, handle it with
do_swap_page()(See
Section D.5.5.1)
1349-1354 Handle the case where the page is been written to
1350-1351
If the PTE is marked write-only, it is a COW page so handle it with
do_wp_page()(See
Section D.5.6.1)
1353 Otherwise just simply mark the page as dirty
1355 Mark the page as accessed
1356 establish_pte()
copies the PTE and then updates the TLB and MMU
cache. This does not copy in a new PTE but some architectures require the
TLB and MMU update
1357 Unlock the mm_struct and return that a minor fault occurred
D.5.4 Demand Allocation
D.5.4.1
Function: do_no_page()
(mm/memory.c)
The call graph for this function is shown in Figure 4.15. This function is called
the rst time a page is referenced so that it may be allocated and lled with data if
vm_ops available
do_anonymous_page() is
necessary. If it is an anonymous page, determined by the lack of a
nopage() function, then
Otherwise the supplied nopage() function is called to allocate a page and it
to the VMA or the lack of a
called.
is inserted into the page tables here. The function has the following tasks;
•
Check if
do_anonymous_page()
should be used and if so, call it and return
the page it allocates. If not, call the supplied
nopage()
function and ensure
it allocates a page successfully.
•
Break COW early if appropriate
•
Add the page to the page table entries and call the appropriate architecture
dependent hooks
1245 static int do_no_page(struct mm_struct * mm,
struct vm_area_struct * vma,
1246
unsigned long address, int write_access, pte_t *page_table)
1247 {
1248
struct page * new_page;
1249
pte_t entry;
1250
1251
if (!vma->vm_ops || !vma->vm_ops->nopage)
D.5.4 Demand Allocation (do_no_page())
1252
1253
1254
1255
1256
1257
1258
1259
1260
328
return do_anonymous_page(mm, vma, page_table,
write_access, address);
spin_unlock(&mm->page_table_lock);
new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, 0);
if (new_page == NULL) /* no page was available -- SIGBUS */
return 0;
if (new_page == NOPAGE_OOM)
return -1;
1245 The parameters supplied are the same as those for handle_pte_fault()
1251-1252 If no vm_ops is supplied or no nopage() function is supplied, then call
do_anonymous_page()(See
Section D.5.4.2) to allocate a page and return it
1253 Otherwise free the page table lock as the nopage() function can not be called
with spinlocks held
1255 Call the supplied nopage function, in the case of lesystems, this is frequently
filemap_nopage()(See
Section D.6.4.1) but will be dierent for each device
driver
1257-1258 If NULL is returned, it means some error occurred in the nopage function such as an IO error while reading from disk. In this case, 0 is returned
which results in a SIGBUS been sent to the faulting process
1259-1260 If NOPAGE_OOM is returned, the physical page allocator failed to allocate
a page and -1 is returned which will forcibly kill the process
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
if (write_access && !(vma->vm_flags & VM_SHARED)) {
struct page * page = alloc_page(GFP_HIGHUSER);
if (!page) {
page_cache_release(new_page);
return -1;
}
copy_user_highpage(page, new_page, address);
page_cache_release(new_page);
lru_cache_add(page);
new_page = page;
}
Break COW early in this block if appropriate. COW is broken if the fault is a
write fault and the region is not shared with
VM_SHARED. If COW was not broken in
this case, a second fault would occur immediately upon return.
1265 Check if COW should be broken early
D.5.4 Demand Allocation (do_no_page())
329
1266 If so, allocate a new page for the process
1267-1270
If the page could not be allocated, reduce the reference count to the
page returned by the
nopage()
function and return -1 for out of memory
1271 Otherwise copy the contents
1272
Reduce the reference count to the returned page which may still be in use
by another process
1273 Add the new page to the LRU lists so it may be reclaimed by kswapd later
1277
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
spin_lock(&mm->page_table_lock);
/* Only go through if we didn't race with anybody else... */
if (pte_none(*page_table)) {
++mm->rss;
flush_page_to_ram(new_page);
flush_icache_page(vma, new_page);
entry = mk_pte(new_page, vma->vm_page_prot);
if (write_access)
entry = pte_mkwrite(pte_mkdirty(entry));
set_pte(page_table, entry);
} else {
/* One of our sibling threads was faster, back out. */
page_cache_release(new_page);
spin_unlock(&mm->page_table_lock);
return 1;
}
1305
1306
1307
1308 }
1277
/* no need to invalidate: a not-present page shouldn't
* be cached
*/
update_mmu_cache(vma, address, entry);
spin_unlock(&mm->page_table_lock);
return 2;
/* Major fault */
Lock the page tables again as the allocations have nished and the page
tables are about to be updated
1289
Check if there is still no PTE in the entry we are about to use.
If two
faults hit here at the same time, it is possible another processor has already
completed the page fault and this one should be backed out
1290-1297 If there is no PTE entered, complete the fault
D.5.4 Demand Allocation (do_no_page())
1290
330
Increase the RSS count as the process is now using another page. A check
really should be made here to make sure it isn't the global zero page as the
RSS count could be misleading
1291
As the page is about to be mapped to the process space, it is possible for
some architectures that writes to the page in kernel space will not be visible to
the process.
flush_page_to_ram()
ensures the CPU cache will be coherent
1292 flush_icache_page() is similar in principle except it ensures the icache and
dcache's are coherent
1293 Create a pte_t with the appropriate permissions
1294-1295 If this is a write, then make sure the PTE has write permissions
1296 Place the new PTE in the process page tables
1297-1302
If the PTE is already lled, the page acquired from the
nopage()
function must be released
1299 Decrement the reference count to the page.
If it drops to 0, it will be freed
1300-1301 Release the mm_struct lock and return 1 to signal this is a minor page
fault as no major work had to be done for this fault as it was all done by the
winner of the race
1305 Update the MMU cache for architectures that require it
1306-1307 Release the mm_struct lock and return 2 to signal this is a major page
fault
D.5.4.2
Function: do_anonymous_page()
(mm/memory.c)
This function allocates a new page for a process accessing a page for the rst
time. If it is a read access, a system wide page containing only zeros is mapped into
the process. If it is write, a zero lled page is allocated and placed within the page
tables
1190 static int do_anonymous_page(struct mm_struct * mm,
struct vm_area_struct * vma,
pte_t *page_table, int write_access,
unsigned long addr)
1191 {
1192
pte_t entry;
1193
1194
/* Read-only mapping of ZERO_PAGE. */
1195
entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr),
vma->vm_page_prot));
1196
D.5.4 Demand Allocation (do_anonymous_page())
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
331
/* ..except if it's a write access */
if (write_access) {
struct page *page;
/* Allocate our own private page. */
spin_unlock(&mm->page_table_lock);
page = alloc_page(GFP_HIGHUSER);
if (!page)
goto no_mem;
clear_user_highpage(page, addr);
spin_lock(&mm->page_table_lock);
if (!pte_none(*page_table)) {
page_cache_release(page);
spin_unlock(&mm->page_table_lock);
return 1;
}
mm->rss++;
flush_page_to_ram(page);
entry = pte_mkwrite(
pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
lru_cache_add(page);
mark_page_accessed(page);
1218
1219
1220
}
1221
1222
set_pte(page_table, entry);
1223
1224
/* No need to invalidate - it was non-present before */
1225
update_mmu_cache(vma, addr, entry);
1226
spin_unlock(&mm->page_table_lock);
1227
return 1;
/* Minor fault */
1228
1229 no_mem:
1230
return -1;
1231 }
1190
The parameters are the same as those passed to
handle_pte_fault()
(See Section D.5.3.2)
1195 For read accesses, simply map the system wide empty_zero_page which the
ZERO_PAGE()
macro returns with the given permissions.
The page is write
protected so that a write to the page will result in a page fault
1198-1220 If this is a write fault, then allocate a new page and zero ll it
D.5.5 Demand Paging
332
1202 Unlock the mm_struct as the allocation of a new page could sleep
1204 Allocate a new page
1205 If a page could not be allocated, return -1 to handle the OOM situation
1207 Zero ll the page
1209 Reacquire the lock as the page tables are to be updated
1215
Update the RSS for the process. Note that the RSS is not updated if it is
the global zero page being mapped as is the case with the read-only fault at
line 1195
1216 Ensure the cache is coherent
1217 Mark the PTE writable and dirty as it has been written to
1218 Add the page to the LRU list so it may be reclaimed by the swapper later
1219 Mark the page accessed which ensures the page is marked hot and on the top
of the active list
1222 Fix the PTE in the page tables for this process
1225 Update the MMU cache if the architecture needs it
1226 Free the page table lock
1227 Return as a minor fault as even though it is possible the page allocator spent
time writing out pages, data did not have to be read from disk to ll this page
D.5.5 Demand Paging
D.5.5.1
Function: do_swap_page()
(mm/memory.c)
The call graph for this function is shown in Figure 4.16. This function handles
the case where a page has been swapped out.
A swapped out page may exist in
the swap cache if it is shared between a number of processes or recently swapped in
during readahead. This function is broken up into three parts
•
Search for the page in swap cache
•
If it does not exist, call
•
Insert the page into the process page tables
swapin_readahead()
to read in the page
D.5.5 Demand Paging (do_swap_page())
333
1117 static int do_swap_page(struct mm_struct * mm,
1118
struct vm_area_struct * vma, unsigned long address,
1119
pte_t * page_table, pte_t orig_pte, int write_access)
1120 {
1121
struct page *page;
1122
swp_entry_t entry = pte_to_swp_entry(orig_pte);
1123
pte_t pte;
1124
int ret = 1;
1125
1126
spin_unlock(&mm->page_table_lock);
1127
page = lookup_swap_cache(entry);
Function preamble, check for the page in the swap cache
1117-1119 The parameters are the same as those supplied to handle_pte_fault()
(See Section D.5.3.2)
1122 Get the swap entry information from the PTE
1126 Free the mm_struct spinlock
1127 Lookup the page in the swap cache
1128
1129
1130
1131
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
if (!page) {
swapin_readahead(entry);
page = read_swap_cache_async(entry);
if (!page) {
int retval;
spin_lock(&mm->page_table_lock);
retval = pte_same(*page_table, orig_pte) ? -1 : 1;
spin_unlock(&mm->page_table_lock);
return retval;
}
}
/* Had to read the page from swap area: Major fault */
ret = 2;
If the page did not exist in the swap cache, then read it from backing storage
with
swapin_readhead() which reads in the requested pages and a number of pages
read_swap_cache_async() should be able to return the
after it. Once it completes,
page.
1128-1145 This block is executed if the page was not in the swap cache
D.5.5 Demand Paging (do_swap_page())
1129 swapin_readahead()(See Section D.6.6.1)
a number of pages after it.
334
reads in the requested page and
The number of pages read in is determined by
page_cluster variable in mm/swap.c which is initialised to 2 on machines
page_cluster
with less than 16MiB of memory and 3 otherwise. 2
pages are read
the
in after the requested page unless a bad or empty page entry is encountered
1130 read_swap_cache_async() (See Section K.3.1.1) will look up the requested
page and read it from disk if necessary
1131-1141
If the page does not exist, there was another fault which swapped in
this page and removed it from the cache while spinlocks were dropped
1137 Lock the mm_struct
1138 Compare the two PTEs.
If they do not match, -1 is returned to signal an IO
error, else 1 is returned to mark a minor page fault as a disk access was not
required for this particular page.
1139-1140 Free the mm_struct and return the status
1144 The disk had to be accessed to mark that this is a major page fault
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
mark_page_accessed(page);
lock_page(page);
/*
* Back out if somebody else faulted in this pte while we
* released the page table lock.
*/
spin_lock(&mm->page_table_lock);
if (!pte_same(*page_table, orig_pte)) {
spin_unlock(&mm->page_table_lock);
unlock_page(page);
page_cache_release(page);
return 1;
}
/* The page isn't present yet, go ahead with the fault. */
swap_free(entry);
if (vm_swap_full())
remove_exclusive_swap_page(page);
mm->rss++;
pte = mk_pte(page, vma->vm_page_prot);
if (write_access && can_share_swap_page(page))
D.5.5 Demand Paging (do_swap_page())
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183 }
335
pte = pte_mkdirty(pte_mkwrite(pte));
unlock_page(page);
flush_page_to_ram(page);
flush_icache_page(vma, page);
set_pte(page_table, pte);
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, address, pte);
spin_unlock(&mm->page_table_lock);
return ret;
Place the page in the process page tables
1147 mark_page_accessed()(See Section J.2.3.1) will mark the page as active so
it will be moved to the top of the active LRU list
1149 Lock the page which has the side eect of waiting for the IO swapping in the
page to complete
1155-1161
If someone else faulted in the page before we could, the reference to
the page is dropped, the lock freed and return that this was a minor fault
1165
The function
swap_free()(See
Section K.2.2.1) reduces the reference to a
swap entry. If it drops to 0, it is actually freed
1166-1167 Page slots in swap space are reserved for the same page once they have
been swapped out to avoid having to search for a free slot each time. If the
swap space is full though, the reservation is broken and the slot freed up for
another page
1169 The page is now going to be used so increment the mm_structs RSS count
1170 Make a PTE for this page
1171
If the page is been written to and is not shared between more than one
process, mark it dirty so that it will be kept in sync with the backing storage
and swap cache for other processes
1173 Unlock the page
1175
As the page is about to be mapped to the process space, it is possible for
some architectures that writes to the page in kernel space will not be visible
to the process.
flush_page_to_ram()
ensures the cache will be coherent
1176 flush_icache_page() is similar in principle except it ensures the icache and
dcache's are coherent
D.5.5 Demand Paging (do_swap_page())
336
1177 Set the PTE in the process page tables
1180 Update the MMU cache if the architecture requires it
1181-1182
Unlock the
mm_struct
and return whether it was a minor or major
page fault
D.5.5.2
Function: can_share_swap_page()
(mm/swaple.c)
This function determines if the swap cache entry for this page may be used or not.
It may be used if there is no other references to it. Most of the work is performed by
exclusive_swap_page()
but this function rst makes a few basic checks to avoid
having to acquire too many locks.
259 int can_share_swap_page(struct page *page)
260 {
261
int retval = 0;
262
263
if (!PageLocked(page))
264
BUG();
265
switch (page_count(page)) {
266
case 3:
267
if (!page->buffers)
268
break;
269
/* Fallthrough */
270
case 2:
271
if (!PageSwapCache(page))
272
break;
273
retval = exclusive_swap_page(page);
274
break;
275
case 1:
276
if (PageReserved(page))
277
break;
278
retval = 1;
279
}
280
return retval;
281 }
263-264 This function is called from the fault path and the page must be locked
265 Switch based on the number of references
266-268 If the count is 3, but there is no buers associated with it, there is more
than one process using the page. Buers may be associated for just one process
if the page is backed by a swap le instead of a partition
270-273
If the count is only two, but it is not a member of the swap cache, then
it has no slot which may be shared so return false. Otherwise perform a full
check with
exclusive_swap_page()
(See Section D.5.5.3)
D.5.5 Demand Paging (can_share_swap_page())
337
276-277 If the page is reserved, it is the global ZERO_PAGE so it cannot be shared
otherwise this page is denitely the only one
D.5.5.3
Function: exclusive_swap_page()
(mm/swaple.c)
This function checks if the process is the only user of a locked swap page.
229 static int exclusive_swap_page(struct page *page)
230 {
231
int retval = 0;
232
struct swap_info_struct * p;
233
swp_entry_t entry;
234
235
entry.val = page->index;
236
p = swap_info_get(entry);
237
if (p) {
238
/* Is the only swap cache user the cache itself? */
239
if (p->swap_map[SWP_OFFSET(entry)] == 1) {
240
/* Recheck the page count with the pagecache
* lock held.. */
241
spin_lock(&pagecache_lock);
242
if (page_count(page) - !!page->buffers == 2)
243
retval = 1;
244
spin_unlock(&pagecache_lock);
245
}
246
swap_info_put(p);
247
}
248
return retval;
249 }
231 By default, return false
235 The swp_entry_t for the page is stored in page→index as explained in Section
2.4
236 Get the swap_info_struct with swap_info_get()(See Section K.2.3.1)
237-247 If a slot exists, check if we are the exclusive user and return true if we are
239 Check if the slot is only being used by the cache itself.
needs to be checked again with the
pagecache_lock
If it is, the page count
held
242-243 !!page→buffers will evaluate to 1 if there is buers are present so this
block eectively checks if the process is the only user of the page.
retval
246
If it is,
is set to 1 so that true will be returned
Drop the reference to the slot that was taken with
(See Section K.2.3.1)
swap_info_get()
D.5.6 Copy On Write (COW) Pages
338
D.5.6 Copy On Write (COW) Pages
D.5.6.1
Function: do_wp_page()
(mm/memory.c)
The call graph for this function is shown in Figure 4.17. This function handles
the case where a user tries to write to a private page shared amoung processes, such
as what happens after
fork().
Basically what happens is a page is allocated, the
contents copied to the new page and the shared count decremented in the old page.
948 static int do_wp_page(struct mm_struct *mm,
struct vm_area_struct * vma,
949
unsigned long address, pte_t *page_table, pte_t pte)
950 {
951
struct page *old_page, *new_page;
952
953
old_page = pte_page(pte);
954
if (!VALID_PAGE(old_page))
955
goto bad_wp_page;
956
948-950 The parameters are the same as those supplied to handle_pte_fault()
953-955 Get a reference to the current page in the PTE and make sure it is valid
957
958
959
960
961
962
963
964
965
966
957
if (!TryLockPage(old_page)) {
int reuse = can_share_swap_page(old_page);
unlock_page(old_page);
if (reuse) {
flush_cache_page(vma, address);
establish_pte(vma, address, page_table,
pte_mkyoung(pte_mkdirty(pte_mkwrite(pte))));
spin_unlock(&mm->page_table_lock);
return 1;
/* Minor fault */
}
}
First try to lock the page. If 0 is returned, it means the page was previously
unlocked
958
If we managed to lock it, call
can_share_swap_page()
(See Section D.5.5.2)
to see are we the exclusive user of the swap slot for this page. If we are, it
means that we are the last process to break COW and we can simply use this
page rather than allocating a new one
960-965
If we are the only users of the swap slot, then it means we are the only
user of this page and the last process to break COW so the PTE is simply
re-established and we return a minor fault
D.5.6 Copy On Write (COW) Pages (do_wp_page())
968
969
970
971
972
973
974
975
976
977
978
339
/*
* Ok, we need to copy. Oh, well..
*/
page_cache_get(old_page);
spin_unlock(&mm->page_table_lock);
new_page = alloc_page(GFP_HIGHUSER);
if (!new_page)
goto no_mem;
copy_cow_page(old_page,new_page,address);
971 We need to copy this page so rst get a reference to the old page so it doesn't
disappear before we are nished with it
972 Unlock the spinlock as we are about to call alloc_page() (See Section F.2.1)
which may sleep
974-976 Allocate a page and make sure one was returned
977 No prizes what this function does.
page,
If the page being broken is the global zero
clear_user_highpage() will be used to zero out the contents
copy_user_highpage() copies the actual contents
of the
page, otherwise
982
983
984
985
986
987
988
989
990
991
992
993
994
995
spin_lock(&mm->page_table_lock);
if (pte_same(*page_table, pte)) {
if (PageReserved(old_page))
++mm->rss;
break_cow(vma, new_page, address, page_table);
lru_cache_add(new_page);
/* Free the old page.. */
new_page = old_page;
}
spin_unlock(&mm->page_table_lock);
page_cache_release(new_page);
page_cache_release(old_page);
return 1;
/* Minor fault */
982 The page table lock was released for alloc_page()(See Section F.2.1) so reacquire it
983
Make sure the PTE hasn't changed in the meantime which could have happened if another fault occured while the spinlock is released
984-985 The RSS is only updated if PageReserved() is true which will only happen if the page being faulted is the global
ZERO_PAGE
which is not accounted
D.5.6 Copy On Write (COW) Pages (do_wp_page())
340
for in the RSS. If this was a normal page, the process would be using the same
number of physical frames after the fault as it was before but against the zero
page, it'll be using a new frame so
986 break_cow()
rss++
is responsible for calling the architecture hooks to ensure the
CPU cache and TLBs are up to date and then establish the new page into
the PTE. It rst calls
flush_page_to_ram()
struct page is about to be placed in userspace.
which must be called when a
flush_cache_page()
establish_pte() which
Next is
which ushes the page from the CPU cache. Lastly is
establishes the new page into the PTE
987 Add the page to the LRU lists
992 Release the spinlock
993-994 Drop the references to the pages
995 Return a minor fault
996
997 bad_wp_page:
998
spin_unlock(&mm->page_table_lock);
999
printk("do_wp_page: bogus page at address %08lx (page 0x%lx)\n",
address,(unsigned long)old_page);
1000
return -1;
1001 no_mem:
1002
page_cache_release(old_page);
1003
return -1;
1004 }
997-1000 This is a false COW break which will only happen with a buggy kernel.
Print out an informational message and return
1001-1003 The page allocation failed so release the reference to the old page and
return -1
D.6 Page-Related Disk IO
341
D.6 Page-Related Disk IO
Contents
D.6 Page-Related Disk IO
D.6.1 Generic File Reading
D.6.1.1 Function: generic_file_read()
D.6.1.2 Function: do_generic_file_read()
D.6.1.3 Function: generic_file_readahead()
D.6.2 Generic File mmap()
D.6.2.1 Function: generic_file_mmap()
D.6.3 Generic File Truncation
D.6.3.1 Function: vmtruncate()
D.6.3.2 Function: vmtruncate_list()
D.6.3.3 Function: zap_page_range()
D.6.3.4 Function: zap_pmd_range()
D.6.3.5 Function: zap_pte_range()
D.6.3.6 Function: truncate_inode_pages()
D.6.3.7 Function: truncate_list_pages()
D.6.3.8 Function: truncate_complete_page()
D.6.3.9 Function: do_flushpage()
D.6.3.10 Function: truncate_partial_page()
D.6.4 Reading Pages for the Page Cache
D.6.4.1 Function: filemap_nopage()
D.6.4.2 Function: page_cache_read()
D.6.5 File Readahead for nopage()
D.6.5.1 Function: nopage_sequential_readahead()
D.6.5.2 Function: read_cluster_nonblocking()
D.6.6 Swap Related Read-Ahead
D.6.6.1 Function: swapin_readahead()
D.6.6.2 Function: valid_swaphandles()
341
341
341
344
351
355
355
356
356
358
359
361
362
364
365
367
368
368
369
369
374
375
375
377
378
378
379
D.6.1 Generic File Reading
This is more the domain of the IO manager than the VM but because it performs the operations via the page cache, we will cover it briey.
of
generic_file_write()
The operation
is essentially the same although it is not covered by this
book. However, if you understand how the read takes place, the write function will
pose no problem to you.
D.6.1.1
Function: generic_file_read()
(mm/lemap.c)
This is the generic le read function used by any lesystem that reads pages
read_descriptor_t
do_generic_file_read() and file_read_actor(). For direct IO, this
basically a wrapper around generic_file_direct_IO().
through the page cache. For normal IO, it is responsible for building a
for use with
function is
1695 ssize_t generic_file_read(struct file * filp,
char * buf, size_t count,
D.6.1 Generic File Reading (generic_file_read())
1696 {
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
342
loff_t *ppos)
ssize_t retval;
if ((ssize_t) count < 0)
return -EINVAL;
if (filp->f_flags & O_DIRECT)
goto o_direct;
retval = -EFAULT;
if (access_ok(VERIFY_WRITE, buf, count)) {
retval = 0;
if (count) {
read_descriptor_t desc;
desc.written = 0;
desc.count = count;
desc.buf = buf;
desc.error = 0;
do_generic_file_read(filp, ppos, &desc,
file_read_actor);
}
retval = desc.written;
if (!retval)
retval = desc.error;
}
out:
return retval;
This block is concern with normal le IO.
1702-1703 If this is direct IO, jump to the o_direct label
1706 If the access permissions to write to a userspace page are ok, then proceed
1709 If count is 0, there is no IO to perform
1712-1715
Populate a
read_descriptor_t
file_read_actor()(See
structure which will be used by
Section L.3.2.3)
1716 Perform the le read
1718 Extract the number of bytes written from the read descriptor struct
D.6.1 Generic File Reading (generic_file_read())
343
1282-1683 If an error occured, extract what the error was
1724 Return either the number of bytes read or the error that occured
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
o_direct:
{
loff_t pos = *ppos, size;
struct address_space *mapping =
filp->f_dentry->d_inode->i_mapping;
struct inode *inode = mapping->host;
1740
1741
1742
1743
1744
1745
1746 }
}
retval = 0;
if (!count)
goto out; /* skip atime */
down_read(&inode->i_alloc_sem);
down(&inode->i_sem);
size = inode->i_size;
if (pos < size) {
retval = generic_file_direct_IO(READ, filp, buf,
count, pos);
if (retval > 0)
*ppos = pos + retval;
}
UPDATE_ATIME(filp->f_dentry->d_inode);
goto out;
This block is concerned with direct IO. It is largely responsible for extracting
the parameters required for
generic_file_direct_IO().
1729 Get the address_space used by this struct file
1733-1734 If no IO has been requested, jump to out to avoid updating the inodes
access time
1737 Get the size of the le
1299-1700
call
If the current position is before the end of the le, the read is safe so
generic_file_direct_IO()
1740-1741
If the read was successful, update the current position in the le for
the reader
1743 Update the access time
1744 Goto out which just returns retval
D.6.1.2 Function: do_generic_file_read()
D.6.1.2
Function: do_generic_file_read()
344
(mm/lemap.c)
This is the core part of the generic le read operation.
It is responsible for
allocating a page if it doesn't already exist in the page cache. If it does, it must
make sure the page is up-to-date and nally, it is responsible for making sure that
the appropriate readahead window is set.
1349 void do_generic_file_read(struct file * filp,
loff_t *ppos,
read_descriptor_t * desc,
read_actor_t actor)
1350 {
1351
struct address_space *mapping =
filp->f_dentry->d_inode->i_mapping;
1352
struct inode *inode = mapping->host;
1353
unsigned long index, offset;
1354
struct page *cached_page;
1355
int reada_ok;
1356
int error;
1357
int max_readahead = get_max_readahead(inode);
1358
1359
cached_page = NULL;
1360
index = *ppos >> PAGE_CACHE_SHIFT;
1361
offset = *ppos & ~PAGE_CACHE_MASK;
1362
1357 Get the maximum readahead window size for this block device
1360 Calculate the page index which holds the current le position pointer
1361 Calculate the oset within the page that holds the current le position pointer
1363 /*
1364 * If the current position is outside the previous read-ahead
1365 * window, we reset the current read-ahead context and set read
1366 * ahead max to zero (will be set to just needed value later),
1367 * otherwise, we assume that the file accesses are sequential
1368 * enough to continue read-ahead.
1369 */
1370
if (index > filp->f_raend ||
index + filp->f_rawin < filp->f_raend) {
1371
reada_ok = 0;
1372
filp->f_raend = 0;
1373
filp->f_ralen = 0;
1374
filp->f_ramax = 0;
1375
filp->f_rawin = 0;
1376
} else {
D.6.1 Generic File Reading (do_generic_file_read())
345
1377
reada_ok = 1;
1378
}
1379 /*
1380 * Adjust the current value of read-ahead max.
1381 * If the read operation stay in the first half page, force no
1382 * readahead. Otherwise try to increase read ahead max just
* enough to do the read request.
1383 * Then, at least MIN_READAHEAD if read ahead is ok,
1384 * and at most MAX_READAHEAD in all cases.
1385 */
1386
if (!index && offset + desc->count <= (PAGE_CACHE_SIZE >> 1)) {
1387
filp->f_ramax = 0;
1388
} else {
1389
unsigned long needed;
1390
1391
needed = ((offset + desc->count) >> PAGE_CACHE_SHIFT) + 1;
1392
1393
if (filp->f_ramax < needed)
1394
filp->f_ramax = needed;
1395
1396
if (reada_ok && filp->f_ramax < vm_min_readahead)
1397
filp->f_ramax = vm_min_readahead;
1398
if (filp->f_ramax > max_readahead)
1399
filp->f_ramax = max_readahead;
1400
}
1370-1378 As the comment suggests, the readahead window gets reset if the current le position is outside the current readahead window.
0 here and adjusted by
generic_file_readahead()(See
It gets reset to
Section D.6.1.3) as
necessary
1386-1400
As the comment states, the readahead window gets adjusted slightly
if we are in the second-half of the current page
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
for (;;) {
struct page *page, **hash;
unsigned long end_index, nr, ret;
end_index = inode->i_size >> PAGE_CACHE_SHIFT;
if (index > end_index)
break;
nr = PAGE_CACHE_SIZE;
if (index == end_index) {
nr = inode->i_size & ~PAGE_CACHE_MASK;
if (nr <= offset)
D.6.1 Generic File Reading (do_generic_file_read())
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
}
346
break;
nr = nr - offset;
/*
* Try to find the data in the page cache..
*/
hash = page_hash(mapping, index);
spin_lock(&pagecache_lock);
page = __find_page_nolock(mapping, index, *hash);
if (!page)
goto no_cached_page;
1402 This loop goes through each of the pages necessary to satisfy the read request
1406 Calculate where the end of the le is in pages
1408-1409 If the current index is beyond the end, then break out as we are trying
to read beyond the end of the le
1410-1417
Calculate
nr
to be the number of bytes remaining to be read in the
current page. The block takes into account that this might be the last page
used by the le and where the current le position is within the page
1422-1425 Search for the page in the page cache
1426-1427
If the page is not in the page cache, goto
no_cached_page
where it
will be allocated
1428 found_page:
1429
page_cache_get(page);
1430
spin_unlock(&pagecache_lock);
1431
1432
if (!Page_Uptodate(page))
1433
goto page_not_up_to_date;
1434
generic_file_readahead(reada_ok, filp, inode, page);
In this block, the page was found in the page cache.
1429
Take a reference to the page in the page cache so it does not get freed
prematurly
1432-1433
If the page is not up-to-date, goto
page_not_up_to_date
to update
the page with information on the disk
1434 Perform le readahead with generic_file_readahead()(See Section D.6.1.3)
D.6.1 Generic File Reading (do_generic_file_read())
347
1435 page_ok:
1436
/* If users can be writing to this page using arbitrary
1437
* virtual addresses, take care about potential aliasing
1438
* before reading the page on the kernel side.
1439
*/
1440
if (mapping->i_mmap_shared != NULL)
1441
flush_dcache_page(page);
1442
1443
/*
1444
* Mark the page accessed if we read the
1445
* beginning or we just did an lseek.
1446
*/
1447
if (!offset || !filp->f_reada)
1448
mark_page_accessed(page);
1449
1450
/*
1451
* Ok, we have the page, and it's up-to-date, so
1452
* now we can copy it to user space...
1453
*
1454
* The actor routine returns how many bytes were actually used..
1455
* NOTE! This may not be the same as how much of a user buffer
1456
* we filled up (we may be padding etc), so we can only update
1457
* "pos" here (the actor routine has to update the user buffer
1458
* pointers and the remaining count).
1459
*/
1460
ret = actor(desc, page, offset, nr);
1461
offset += ret;
1462
index += offset >> PAGE_CACHE_SHIFT;
1463
offset &= ~PAGE_CACHE_MASK;
1464
1465
page_cache_release(page);
1466
if (ret == nr && desc->count)
1467
continue;
1468
break;
In this block, the page is present in the page cache and ready to be read by the
le read actor function.
1440-1441 As other users could be writing this page, call flush_dcache_page()
to make sure the changes are visible
1447-1448
As the page has just been accessed, call
(See Section J.2.3.1) to move it to the
1460 Call the actor function.
active_list
mark_page_accessed()
In this case, the actor function is
file_read_actor()
(See Section L.3.2.3) which is responsible for copying the bytes from the page
to userspace
D.6.1 Generic File Reading (do_generic_file_read())
348
1461 Update the current oset within the le
1462 Move to the next page if necessary
1463 Update the oset within the page we are currently reading.
Remember that
we could have just crossed into the next page in the le
1465 Release our reference to this page
1466-1468
If there is still data to be read, loop again to read the next page.
Otherwise break as the read operation is complete
1470 /*
1471 * Ok, the page was not immediately readable, so let's try to read
* ahead while we're at it..
1472 */
1473 page_not_up_to_date:
1474
generic_file_readahead(reada_ok, filp, inode, page);
1475
1476
if (Page_Uptodate(page))
1477
goto page_ok;
1478
1479
/* Get exclusive access to the page ... */
1480
lock_page(page);
1481
1482
/* Did it get unhashed before we got the lock? */
1483
if (!page->mapping) {
1484
UnlockPage(page);
1485
page_cache_release(page);
1486
continue;
1487
}
1488
1489
/* Did somebody else fill it already? */
1490
if (Page_Uptodate(page)) {
1491
UnlockPage(page);
1492
goto page_ok;
1493
}
In this block, the page being read was not up-to-date with information on the
disk.
generic_file_readahead()
is called to update the current page and reada-
head as IO is required anyway.
1474
Call
generic_file_readahead()(See
Section D.6.1.3) to sync the current
page and readahead if necessary
1476-1477 If the page is now up-to-date, goto page_ok to start copying the bytes
to userspace
D.6.1 Generic File Reading (do_generic_file_read())
349
1480 Otherwise something happened with readahead so lock the page for exclusive
access
1483-1487 If the page was somehow removed from the page cache while spinlocks
were not held, then release the reference to the page and start all over again.
The second time around, the page will get allocated and inserted into the page
cache all over again
1490-1493 If someone updated the page while we did not have a lock on the page
then unlock it again and goto
page_ok
to copy the bytes to userspace
1495 readpage:
1496
/* ... and start the actual read. The read will
* unlock the page. */
1497
error = mapping->a_ops->readpage(filp, page);
1498
1499
if (!error) {
1500
if (Page_Uptodate(page))
1501
goto page_ok;
1502
1503
/* Again, try some read-ahead while waiting for
* the page to finish.. */
1504
generic_file_readahead(reada_ok, filp, inode, page);
1505
wait_on_page(page);
1506
if (Page_Uptodate(page))
1507
goto page_ok;
1508
error = -EIO;
1509
}
1510
1511
/* UHHUH! A synchronous read error occurred. Report it */
1512
desc->error = error;
1513
page_cache_release(page);
1514
break;
At this block, readahead failed to we synchronously read the page with the
address_space
1497
Call the
supplied
readpage()
address_space
function.
cases this will ultimatly call the function
in
fs/buffer.c()
1499-1501
readpage() function. In many
block_read_full_page() declared
lesystem-specic
If no error occurred and the page is now up-to-date, goto
page_ok
to
begin copying the bytes to userspace
1504 Otherwise, schedule some readahead to occur as we are forced to wait on IO
anyway
D.6.1 Generic File Reading (do_generic_file_read())
1505-1507
350
Wait for IO on the requested page to complete. If it nished success-
fully, then goto
page_ok
1508 Otherwise an error occured so set -EIO to be returned to userspace
1512-1514 An IO error occured so record it and release the reference to the current
page.
This error will be picked up from the
generic_file_read()
read_descriptor_t
struct by
(See Section D.6.1.1)
1516 no_cached_page:
1517
/*
1518
* Ok, it wasn't cached, so we need to create a new
1519
* page..
1520
*
1521
* We get here with the page cache lock held.
1522
*/
1523
if (!cached_page) {
1524
spin_unlock(&pagecache_lock);
1525
cached_page = page_cache_alloc(mapping);
1526
if (!cached_page) {
1527
desc->error = -ENOMEM;
1528
break;
1529
}
1530
1531
/*
1532
* Somebody may have added the page while we
1533
* dropped the page cache lock. Check for that.
1534
*/
1535
spin_lock(&pagecache_lock);
1536
page = __find_page_nolock(mapping, index, *hash);
1537
if (page)
1538
goto found_page;
1539
}
1540
1541
/*
1542
* Ok, add the new page to the hash-queues...
1543
*/
1544
page = cached_page;
1545
__add_to_page_cache(page, mapping, index, hash);
1546
spin_unlock(&pagecache_lock);
1547
lru_cache_add(page);
1548
cached_page = NULL;
1549
1550
goto readpage;
1551
}
D.6.1 Generic File Reading (do_generic_file_read())
351
In this block, the page does not exist in the page cache so allocate one and add
it.
1523-1539
If a cache page has not already been allocated then allocate one and
make sure that someone else did not insert one into the page cache while we
were sleeping
1524 Release pagecache_lock as page_cache_alloc() may sleep
1525-1529 Allocate a page and set -ENOMEM to be returned if the allocation failed
1535-1536 Acquire pagecache_lock again and search the page cache to make sure
another process has not inserted it while the lock was dropped
1537
If another process added a suitable page to the cache already, jump to
found_page
as the one we just allocated is no longer necessary
1544-1545 Otherwise, add the page we just allocated to the page cache
1547 Add the page to the LRU lists
1548 Set cached_page to NULL as it is now in use
1550 Goto readpage to schedule the page to be read from disk
1552
1553
1554
1555
1556
1557
1558 }
*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
filp->f_reada = 1;
if (cached_page)
page_cache_release(cached_page);
UPDATE_ATIME(inode);
1553 Update our position within the le
1555-1556 If a page was allocated for addition to the page cache and then found
to be unneeded, release it here
1557 Update the access time to the le
D.6.1.3
Function: generic_file_readahead()
(mm/lemap.c)
This function performs generic le read-ahead. Readahead is one of the few areas
that is very heavily commented upon in the code. It is highly recommended that
you read the comments in
mm/filemap.c
marked with Read-ahead context.
D.6.1 Generic File Reading (generic_file_readahead())
352
1222 static void generic_file_readahead(int reada_ok,
1223
struct file * filp, struct inode * inode,
1224
struct page * page)
1225 {
1226
unsigned long end_index;
1227
unsigned long index = page->index;
1228
unsigned long max_ahead, ahead;
1229
unsigned long raend;
1230
int max_readahead = get_max_readahead(inode);
1231
1232
end_index = inode->i_size >> PAGE_CACHE_SHIFT;
1233
1234
raend = filp->f_raend;
1235
max_ahead = 0;
1227 Get the index to start from based on the supplied page
1230 Get the maximum sized readahead for this block device
1232 Get the index, in pages, of the end of the le
1234 Get the end of the readahead window from the struct file
1236
1237 /*
1238 * The current page is locked.
1239 * If the current position is inside the previous read IO request,
1240 * do not try to reread previously read ahead pages.
1241 * Otherwise decide or not to read ahead some pages synchronously.
1242 * If we are not going to read ahead, set the read ahead context
1243 * for this page only.
1244 */
1245
if (PageLocked(page)) {
1246
if (!filp->f_ralen ||
index >= raend ||
index + filp->f_rawin < raend) {
1247
raend = index;
1248
if (raend < end_index)
1249
max_ahead = filp->f_ramax;
1250
filp->f_rawin = 0;
1251
filp->f_ralen = 1;
1252
if (!max_ahead) {
1253
filp->f_raend = index + filp->f_ralen;
1254
filp->f_rawin += filp->f_ralen;
1255
}
1256
}
1257
}
D.6.1 Generic File Reading (generic_file_readahead())
353
This block has encountered a page that is locked so it must decide whether to
temporarily disable readahead.
1245 If the current page is locked for IO, then check if the current page is within
the last readahead window.
If it is, there is no point trying to readahead
again. If it is not, or readahead has not been performed previously, update
the readahead context
1246
The rst check is if readahead has been performed previously. The second
is to see if the current locked page is after where the the previous readahead
nished. The third check is if the current locked page is within the current
readahead window
1247 Update the end of the readahead window
1248-1249 If the end of the readahead window is not after the end of the le, set
max_ahead to be the maximum amount of readahead that should be used with
this struct file(filp→f_ramax)
1250-1255 Set readahead to only occur with the current page, eectively disabling
readahead
1258 /*
1259 * The current page is not locked.
1260 * If we were reading ahead and,
1261 * if the current max read ahead size is not zero and,
1262 * if the current position is inside the last read-ahead IO
1263 * request, it is the moment to try to read ahead asynchronously.
1264 * We will later force unplug device in order to force
* asynchronous read IO.
1265 */
1266
else if (reada_ok && filp->f_ramax && raend >= 1 &&
1267
index <= raend && index + filp->f_ralen >= raend) {
1268 /*
1269 * Add ONE page to max_ahead in order to try to have about the
1270 * same IO maxsize as synchronous read-ahead
* (MAX_READAHEAD + 1)*PAGE_CACHE_SIZE.
1271 * Compute the position of the last page we have tried to read
1272 * in order to begin to read ahead just at the next page.
1273 */
1274
raend -= 1;
1275
if (raend < end_index)
1276
max_ahead = filp->f_ramax + 1;
1277
1278
if (max_ahead) {
1279
filp->f_rawin = filp->f_ralen;
D.6.1 Generic File Reading (generic_file_readahead())
1280
1281
1282
1283
}
}
354
filp->f_ralen = 0;
reada_ok
= 2;
This is one of the rare cases where the in-code commentary makes the code
as clear as it possibly could be.
Basically, it is saying that if the current page is
not locked for IO, then extend the readahead window slight and remember that
readahead is currently going well.
1284 /*
1285 * Try to read ahead pages.
1286 * We hope that ll_rw_blk() plug/unplug, coalescence, requests
1287 * sort and the scheduler, will work enough for us to avoid too
* bad actuals IO requests.
1288 */
1289
ahead = 0;
1290
while (ahead < max_ahead) {
1291
ahead ++;
1292
if ((raend + ahead) >= end_index)
1293
break;
1294
if (page_cache_read(filp, raend + ahead) < 0)
1295
break;
1296
}
This block performs the actual readahead by calling
page_cache_read() for
ahead is incremented
each of the pages in the readahead window. Note here how
for each page that is readahead.
1297 /*
1298 * If we tried to read ahead some pages,
1299 * If we tried to read ahead asynchronously,
1300 * Try to force unplug of the device in order to start an
1301 * asynchronous read IO request.
1302 * Update the read-ahead context.
1303 * Store the length of the current read-ahead window.
1304 * Double the current max read ahead size.
1305 * That heuristic avoid to do some large IO for files that are
1306 * not really accessed sequentially.
1307 */
1308
if (ahead) {
1309
filp->f_ralen += ahead;
1310
filp->f_rawin += filp->f_ralen;
1311
filp->f_raend = raend + ahead + 1;
1312
D.6.2 Generic File mmap()
355
1313
filp->f_ramax += filp->f_ramax;
1314
1315
if (filp->f_ramax > max_readahead)
1316
filp->f_ramax = max_readahead;
1317
1318 #ifdef PROFILE_READAHEAD
1319
profile_readahead((reada_ok == 2), filp);
1320 #endif
1321
}
1322
1323
return;
1324 }
If readahead was successful, then update the readahead elds in the
to mark the progress.
be reset by
struct file
This is basically growing the readahead context but can
do_generic_file_readahead()
if it is found that the readahead is
ineective.
1309
Update the
f_ralen
with the number of pages that were readahead in this
pass
1310 Update the size of the readahead window
1311 Mark the end of hte readahead
1313 Double the current maximum-sized readahead
1315-1316 Do not let the maximum sized readahead get larger than the maximum
readahead dened for this block device
D.6.2 Generic File mmap()
D.6.2.1
Function: generic_file_mmap()
(mm/lemap.c)
mmap() function used by many struct files as their
struct file_operations. It is mainly responsible for ensuring the appropriate
address_space functions exist and setting what VMA operations to use.
This is the generic
2249 int generic_file_mmap(struct file * file,
struct vm_area_struct * vma)
2250 {
2251
struct address_space *mapping =
file->f_dentry->d_inode->i_mapping;
2252
struct inode *inode = mapping->host;
2253
2254
if ((vma->vm_flags & VM_SHARED) &&
(vma->vm_flags & VM_MAYWRITE)) {
2255
if (!mapping->a_ops->writepage)
D.6.3 Generic File Truncation
2256
2257
2258
2259
2260
2261
2262
2263 }
356
return -EINVAL;
}
if (!mapping->a_ops->readpage)
return -ENOEXEC;
UPDATE_ATIME(inode);
vma->vm_ops = &generic_file_vm_ops;
return 0;
2251 Get the address_space that is managing the le being mapped
2252 Get the struct inode for this address_space
2254-2257 If the VMA is to be shared and writable, make sure an a_ops→writepage()
function exists. Return
-EINVAL
if it does not
2258-2259 Make sure a a_ops→readpage() function exists
2260 Update the access time for the inode
2261
generic_file_vm_ops for the le operations. The generic VM operations structure, dened in mm/filemap.c, only supplies filemap_nopage()
(See Section D.6.4.1) as it's nopage() function. No other callback is dened
Use
D.6.3 Generic File Truncation
This section covers the path where a le is being truncated. The actual system call
truncate()
is implemented by
top-level function in the VM is
sys_truncate() in fs/open.c. By the time the
called (vmtruncate()), the dentry information for
the le has been updated and the inode's semaphore has been acquired.
D.6.3.1
Function: vmtruncate()
(mm/memory.c)
This is the top-level VM function responsible for truncating a le.
When it
completes, all page table entries mapping pages that have been truncated have been
unmapped and reclaimed if possible.
1042 int vmtruncate(struct inode * inode, loff_t offset)
1043 {
1044
unsigned long pgoff;
1045
struct address_space *mapping = inode->i_mapping;
1046
unsigned long limit;
1047
1048
if (inode->i_size < offset)
1049
goto do_expand;
1050
inode->i_size = offset;
1051
spin_lock(&mapping->i_shared_lock);
1052
if (!mapping->i_mmap && !mapping->i_mmap_shared)
D.6.3 Generic File Truncation (vmtruncate())
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
357
goto out_unlock;
pgoff = (offset + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (mapping->i_mmap != NULL)
vmtruncate_list(mapping->i_mmap, pgoff);
if (mapping->i_mmap_shared != NULL)
vmtruncate_list(mapping->i_mmap_shared, pgoff);
out_unlock:
spin_unlock(&mapping->i_shared_lock);
truncate_inode_pages(mapping, offset);
goto out_truncate;
do_expand:
limit = current->rlim[RLIMIT_FSIZE].rlim_cur;
if (limit != RLIM_INFINITY && offset > limit)
goto out_sig;
if (offset > inode->i_sb->s_maxbytes)
goto out;
inode->i_size = offset;
out_truncate:
if (inode->i_op && inode->i_op->truncate) {
lock_kernel();
inode->i_op->truncate(inode);
unlock_kernel();
}
return 0;
out_sig:
send_sig(SIGXFSZ, current, 0);
out:
return -EFBIG;
}
1042
The parameters passed are the
inode
marking the new end of the le.
being truncated and the new
offset
The old length of the le is stored in
inode→i_size
1045 Get the address_space responsible for the inode
1048-1049
If the new le size is larger than the old size, then goto
do_expand
where the ulimits for the process will be checked before the le is grown
1050 Here, the le is being shrunk so update inode→i_size to match
1051 Lock the spinlock protecting the two lists of VMAs using this inode
D.6.3 Generic File Truncation (vmtruncate())
358
1052-1053 If no VMAs are mapping the inode, goto out_unlock where the pages
used by the le will be reclaimed by
1055
Calculate
pgoff
truncate_inode_pages() (See
Section D.6.3.6)
as the oset within the le in pages where the truncation
will begin
1056-1057
Truncate pages from all private mappings with
vmtruncate_list()
(See Section D.6.3.2)
1058-1059 Truncate pages from all shared mappings
1062 Unlock the spinlock protecting the VMA lists
1063 Call truncate_inode_pages() (See Section D.6.3.6) to reclaim the pages if
they exist in the page cache for the le
1064
Goto
out_truncate
to call the lesystem specic
truncate()
function so
the blocks used on disk will be freed
1066-1071
If the le is being expanded, make sure that the process limits for
maximum le size are not being exceeded and the hosting lesystem is able to
support the new lesize
1072 If the limits are ne, then update the inodes size and fall through to call the
lesystem-specic truncate function which will ll the expanded lesize with
zeros
1075-1079 If the lesystem provides a truncate() function, then lock the kernel,
call it and unlock the kernel again. Filesystems do not acquire the proper locks
to prevent races between le truncation and le expansion due to writing or
faulting so the big kernel lock is needed
1080 Return success
1082-1084 If the le size would grow to being too big, send the SIGXFSZ signal to
the calling process and return
D.6.3.2
-EFBIG
Function: vmtruncate_list()
(mm/memory.c)
address_spaces
This function cycles through all VMAs in an
zap_page_range()
list and calls
for the range of addresses which map a le that is being trun-
cated.
1006 static void vmtruncate_list(struct vm_area_struct *mpnt,
unsigned long pgoff)
1007 {
1008
do {
1009
struct mm_struct *mm = mpnt->vm_mm;
1010
unsigned long start = mpnt->vm_start;
D.6.3 Generic File Truncation (vmtruncate_list())
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032 }
359
unsigned long end = mpnt->vm_end;
unsigned long len = end - start;
unsigned long diff;
/* mapping wholly truncated? */
if (mpnt->vm_pgoff >= pgoff) {
zap_page_range(mm, start, len);
continue;
}
/* mapping wholly unaffected? */
len = len >> PAGE_SHIFT;
diff = pgoff - mpnt->vm_pgoff;
if (diff >= len)
continue;
/* Ok, partially affected.. */
start += diff << PAGE_SHIFT;
len = (len - diff) << PAGE_SHIFT;
zap_page_range(mm, start, len);
} while ((mpnt = mpnt->vm_next_share) != NULL);
1008-1031 Loop through all VMAs in the list
1009 Get the mm_struct that hosts this VMA
1010-1012 Calculate the start, end and length of the VMA
1016-1019 If the whole VMA is being truncated, call the function zap_page_range()
(See Section D.6.3.3) with the start and length of the full VMA
1022 Calculate the length of the VMA in pages
1023-1025 Check if the VMA maps any of the region being truncated.
If the VMA
in unaected, continue to the next VMA
1028-1029 Else the VMA is being partially truncated so calculate where the start
and length of the region to truncate is in pages
1030 Call zap_page_range() (See Section D.6.3.3) to unmap the aected region
D.6.3.3
Function: zap_page_range()
(mm/memory.c)
This function is the top-level pagetable-walk function which unmaps userpages
in the specied range from a
mm_struct.
D.6.3 Generic File Truncation (zap_page_range())
360
360 void zap_page_range(struct mm_struct *mm,
unsigned long address, unsigned long size)
361 {
362
mmu_gather_t *tlb;
363
pgd_t * dir;
364
unsigned long start = address, end = address + size;
365
int freed = 0;
366
367
dir = pgd_offset(mm, address);
368
369
/*
370
* This is a long-lived spinlock. That's fine.
371
* There's no contention, because the page table
372
* lock only protects against kswapd anyway, and
373
* even if kswapd happened to be looking at this
374
* process we _want_ it to get stuck.
375
*/
376
if (address >= end)
377
BUG();
378
spin_lock(&mm->page_table_lock);
379
flush_cache_range(mm, address, end);
380
tlb = tlb_gather_mmu(mm);
381
382
do {
383
freed += zap_pmd_range(tlb, dir, address, end - address);
384
address = (address + PGDIR_SIZE) & PGDIR_MASK;
385
dir++;
386
} while (address && (address < end));
387
388
/* this will flush any remaining tlb entries */
389
tlb_finish_mmu(tlb, start, end);
390
391
/*
392
* Update rss for the mm_struct (not necessarily current->mm)
393
* Notice that rss is an unsigned long.
394
*/
395
if (mm->rss > freed)
396
mm->rss -= freed;
397
else
398
mm->rss = 0;
399
spin_unlock(&mm->page_table_lock);
400 }
364 Calculate the start and end address for zapping
367 Calculate the PGD (dir) that contains the starting address
D.6.3 Generic File Truncation (zap_page_range())
361
376-377 Make sure the start address is not after the end address
378 Acquire the spinlock protecting the page tables.
This is a very long-held lock
and would normally be considered a bad idea but the comment above the block
explains why it is ok in this case
379 Flush the CPU cache for this range
380 tlb_gather_mmu() records the MM that is being altered.
tlb_remove_page()
will be called to unmap the PTE which stores the PTEs in a struct free_pte_ctx
Later,
until the zapping is nished. This is to avoid having to constantly ush the
TLB as PTEs are freed
382-386
zap_pmd_range() until
tlb is passed as well for
For each PMD aected by the zapping, call
the end address has been reached.
tlb_remove_page()
Note that
to use later
389 tlb_finish_mmu() frees all the PTEs that were unmapped by tlb_remove_page()
and then ushes the TLBs. Doing the ushing this way avoids a storm of TLB
ushing that would be otherwise required for each PTE unmapped
395-398 Update RSS count
399 Release the pagetable lock
D.6.3.4
Function: zap_pmd_range()
(mm/memory.c)
This function is unremarkable. It steps through the PMDs that are aected by
the requested range and calls
zap_pte_range()
for each one.
331 static inline int zap_pmd_range(mmu_gather_t *tlb, pgd_t * dir,
unsigned long address,
unsigned long size)
332 {
333
pmd_t * pmd;
334
unsigned long end;
335
int freed;
336
337
if (pgd_none(*dir))
338
return 0;
339
if (pgd_bad(*dir)) {
340
pgd_ERROR(*dir);
341
pgd_clear(dir);
342
return 0;
343
}
344
pmd = pmd_offset(dir, address);
345
end = address + size;
346
if (end > ((address + PGDIR_SIZE) & PGDIR_MASK))
D.6.3 Generic File Truncation (zap_pmd_range())
347
348
349
350
351
352
353
354
355 }
362
end = ((address + PGDIR_SIZE) & PGDIR_MASK);
freed = 0;
do {
freed += zap_pte_range(tlb, pmd, address, end - address);
address = (address + PMD_SIZE) & PMD_MASK;
pmd++;
} while (address < end);
return freed;
337-338 If no PGD exists, return
339-343 If the PGD is bad, ag the error and return
344 Get the starting pmd
345-347
Calculate the
PGD, then set
end
end
address of the zapping. If it is beyond the end of this
to the end of the PGD
349-353 Step through all PMDs in this PGD. For each PMD, call zap_pte_range()
(See Section D.6.3.5) to unmap the PTEs
354 Return how many pages were freed
D.6.3.5
Function: zap_pte_range()
This function calls
(mm/memory.c)
tlb_remove_page() for each PTE in the requested pmd within
the requested address range.
294 static inline int zap_pte_range(mmu_gather_t *tlb, pmd_t * pmd,
unsigned long address,
unsigned long size)
295 {
296
unsigned long offset;
297
pte_t * ptep;
298
int freed = 0;
299
300
if (pmd_none(*pmd))
301
return 0;
302
if (pmd_bad(*pmd)) {
303
pmd_ERROR(*pmd);
304
pmd_clear(pmd);
305
return 0;
306
}
307
ptep = pte_offset(pmd, address);
308
offset = address & ~PMD_MASK;
309
if (offset + size > PMD_SIZE)
D.6.3 Generic File Truncation (zap_pte_range())
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329 }
363
size = PMD_SIZE - offset;
size &= PAGE_MASK;
for (offset=0; offset < size; ptep++, offset += PAGE_SIZE) {
pte_t pte = *ptep;
if (pte_none(pte))
continue;
if (pte_present(pte)) {
struct page *page = pte_page(pte);
if (VALID_PAGE(page) && !PageReserved(page))
freed ++;
/* This will eventually call __free_pte on the pte. */
tlb_remove_page(tlb, ptep, address + offset);
} else {
free_swap_and_cache(pte_to_swp_entry(pte));
pte_clear(ptep);
}
}
return freed;
300-301 If the PMD does not exist, return
302-306 If the PMD is bad, ag the error and return
307 Get the starting PTE oset
308 Align hte oset to a PMD boundary
309
If the size of the region to unmap is past the PMD boundary, x the size so
that only this PMD will be aected
311 Align size to a page boundary
312-326 Step through all PTEs in the region
314-315 If no PTE exists, continue to the next one
316-322 If the PTE is present, then call tlb_remove_page() to unmap the page.
If the page is reclaimable, increment the
322-325
freed
count
If the PTE is in use but the page is paged out or in the swap
cache, then free the swap slot and page page with
free_swap_and_cache()
(See Section K.3.2.3). It is possible that a page is reclaimed if it was in the
swap cache that is unaccounted for here but it is not of paramount importance
328 Return the number of pages that were freed
D.6.3.6 Function: truncate_inode_pages()
D.6.3.6
Function: truncate_inode_pages()
364
(mm/lemap.c)
This is the top-level function responsible for truncating all pages from the page
cache that occur after
lstart
in a
mapping.
327 void truncate_inode_pages(struct address_space * mapping,
loff_t lstart)
328 {
329
unsigned long start = (lstart + PAGE_CACHE_SIZE - 1) >>
PAGE_CACHE_SHIFT;
330
unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
331
int unlocked;
332
333
spin_lock(&pagecache_lock);
334
do {
335
unlocked = truncate_list_pages(&mapping->clean_pages,
start, &partial);
336
unlocked |= truncate_list_pages(&mapping->dirty_pages,
start, &partial);
337
unlocked |= truncate_list_pages(&mapping->locked_pages,
start, &partial);
338
} while (unlocked);
339
/* Traversed all three lists without dropping the lock */
340
spin_unlock(&pagecache_lock);
341 }
329 Calculate where to start the truncation as an index in pages
330
Calculate
partial
as an oset within the last page if it is being partially
truncated
333 Lock the page cache
334 This will loop until none of the calls to truncate_list_pages() return that
a page was found that should have been reclaimed
335
truncate_list_pages()
clean_pages list
Use
the
(See Section D.6.3.7) to truncate all pages in
336 Similarly, truncate pages in the dirty_pages list
337 Similarly, truncate pages in the locked_pages list
340 Unlock the page cache
D.6.3.7 Function: truncate_list_pages()
D.6.3.7
Function: truncate_list_pages()
365
(mm/lemap.c)
This function searches the requested list (head) which is part of an
If pages are found after
start,
address_space.
they will be truncated.
259 static int truncate_list_pages(struct list_head *head,
unsigned long start,
unsigned *partial)
260 {
261
struct list_head *curr;
262
struct page * page;
263
int unlocked = 0;
264
265 restart:
266
curr = head->prev;
267
while (curr != head) {
268
unsigned long offset;
269
270
page = list_entry(curr, struct page, list);
271
offset = page->index;
272
273
/* Is one of the pages to truncate? */
274
if ((offset >= start) ||
(*partial && (offset + 1) == start)) {
275
int failed;
276
277
page_cache_get(page);
278
failed = TryLockPage(page);
279
280
list_del(head);
281
if (!failed)
282
/* Restart after this page */
283
list_add_tail(head, curr);
284
else
285
/* Restart on this page */
286
list_add(head, curr);
287
288
spin_unlock(&pagecache_lock);
289
unlocked = 1;
290
291
if (!failed) {
292
if (*partial && (offset + 1) == start) {
293
truncate_partial_page(page, *partial);
294
*partial = 0;
295
} else
296
truncate_complete_page(page);
D.6.3 Generic File Truncation (truncate_list_pages())
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315 }
366
UnlockPage(page);
} else
wait_on_page(page);
page_cache_release(page);
if (current->need_resched) {
__set_current_state(TASK_RUNNING);
schedule();
}
spin_lock(&pagecache_lock);
goto restart;
}
curr = curr->prev;
}
return unlocked;
266-267 Record the start of the list and loop until the full list has been scanned
270-271 Get the page for this entry and what offset within the le it represents
274 If the current page is after start or is a page that is to be partially truncated,
then truncate this page, else move to the next one
277-278 Take a reference to the page and try to lock it
280 Remove the page from the list
281-283 If we locked the page, add it back to the list where it will be skipped over
on the next iteration of the loop
284-286
Else add it back where it will be found again immediately. Later in the
function,
wait_on_page()
is called until the page is unlocked
288 Release the pagecache lock
299
Set locked to 1 to indicate a page was found that had to be truncated. This
will force
truncate_inode_pages()
to call this function again to make sure
there are no pages left behind. This looks like an oversight and was intended
to have the functions recalled only if a
locked
page was found but the way it
is implemented means that it will called whether the page was locked or not
291-299 If we locked the page, then truncate it
D.6.3 Generic File Truncation (truncate_list_pages())
292-294
If the page is to be partially truncated, call
367
truncate_partial_page()
(See Section D.6.3.10) with the oset within the page where the truncation
beings (partial)
296
Else call
truncate_complete_page()
(See Section D.6.3.8) to truncate the
whole page
298 Unlock the page
300 If the page locking failed, call wait_on_page() to wait until the page can be
locked
302 Release the reference to the page.
If there are no more mappings for the page,
it will be reclaimed
304-307
Check if the process should call
schedule()
before continuing. This is
to prevent a truncating process from hogging the CPU
309 Reacquire the spinlock and restart the scanning for pages to reclaim
312 The current page should not be reclaimed so move to the next page
314 Return 1 if a page was found in the list that had to be truncated
D.6.3.8
Function: truncate_complete_page()
(mm/lemap.c)
239 static void truncate_complete_page(struct page *page)
240 {
241
/* Leave it on the LRU if it gets converted into
* anonymous buffers */
242
if (!page->buffers || do_flushpage(page, 0))
243
lru_cache_del(page);
244
245
/*
246
* We remove the page from the page cache _after_ we have
247
* destroyed all buffer-cache references to it. Otherwise some
248
* other process might think this inode page is not in the
249
* page cache and creates a buffer-cache alias to it causing
250
* all sorts of fun problems ...
251
*/
252
ClearPageDirty(page);
253
ClearPageUptodate(page);
254
remove_inode_page(page);
255
page_cache_release(page);
256 }
242 If the page has buers, call do_flushpage() (See Section D.6.3.9) to ush all
buers associated with the page. The comments in the following lines describe
the problem concisely
D.6.3 Generic File Truncation (truncate_complete_page())
368
243 Delete the page from the LRU
252-253 Clear the dirty and uptodate ags for the page
254
Call
remove_inode_page()
(See Section J.1.2.1) to delete the page from the
page cache
255
Drop the reference to the page.
truncate_list_pages()
D.6.3.9
The page will be later reclaimed when
drops it's own private refernece to it
Function: do_flushpage()
(mm/lemap.c)
This function is responsible for ushing all buers associated with a page.
223 static int do_flushpage(struct page *page, unsigned long offset)
224 {
225
int (*flushpage) (struct page *, unsigned long);
226
flushpage = page->mapping->a_ops->flushpage;
227
if (flushpage)
228
return (*flushpage)(page, offset);
229
return block_flushpage(page, offset);
230 }
226-228 If the page→mapping provides a flushpage() function, call it
229 Else call block_flushpage() which is the generic function for ushing buers
associated with a page
D.6.3.10
Function: truncate_partial_page()
(mm/lemap.c)
This function partially truncates a page by zeroing out the higher bytes no longer
in use and ushing any associated buers.
232 static inline void truncate_partial_page(struct page *page,
unsigned partial)
233 {
234
memclear_highpage_flush(page, partial, PAGE_CACHE_SIZE-partial);
235
if (page->buffers)
236
do_flushpage(page, partial);
237 }
234 memclear_highpage_flush()
it will zero from
235-236
partial
lls an address range with zeros. In this case,
to the end of the page
If the page has any associated buers, ush any buers containing data
in the truncated region
D.6.4 Reading Pages for the Page Cache
369
D.6.4 Reading Pages for the Page Cache
D.6.4.1
Function: filemap_nopage()
This is the generic
nopage()
(mm/lemap.c)
function used by many VMAs. This loops around
itself with a large number of goto's which can be dicult to trace but there is
nothing novel here. It is principally responsible for fetching the faulting page from
either the pgae cache or reading it from disk. If appropriate it will also perform le
read-ahead.
1994 struct page * filemap_nopage(struct vm_area_struct * area,
unsigned long address,
int unused)
1995 {
1996
int error;
1997
struct file *file = area->vm_file;
1998
struct address_space *mapping =
file->f_dentry->d_inode->i_mapping;
1999
struct inode *inode = mapping->host;
2000
struct page *page, **hash;
2001
unsigned long size, pgoff, endoff;
2002
2003
pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT) +
area->vm_pgoff;
2004
endoff = ((area->vm_end - area->vm_start) >> PAGE_CACHE_SHIFT) +
area->vm_pgoff;
2005
This block acquires the
struct file, addres_space
and
inode
important for
this page fault. It then acquires the starting oset within the le needed for this
fault and the oset that corresponds to the end of this VMA. The oset is the end
of the VMA instead of the end of the page in case le read-ahead is performed.
1997-1999 Acquire the struct file, address_space and inode required for this
fault
2003
Calculate
pgoff
which is the oset within the le corresponding to the be-
ginning of the fault
2004 Calculate the oset within the le corresponding to the end of the VMA
2006 retry_all:
2007
/*
2008
* An external ptracer can access pages that normally aren't
2009
* accessible..
2010
*/
2011
size = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
2012
if ((pgoff >= size) && (area->vm_mm == current->mm))
D.6.4 Reading Pages for the Page Cache (filemap_nopage())
2013
2014
2015
370
return NULL;
/* The "size" of the file, as far as mmap is concerned, isn't
bigger than the mapping */
if (size > endoff)
size = endoff;
2016
2017
2018
2019
/*
2020
* Do we have something in the page cache already?
2021
*/
2022
hash = page_hash(mapping, pgoff);
2023 retry_find:
2024
page = __find_get_page(mapping, pgoff, hash);
2025
if (!page)
2026
goto no_cached_page;
2027
2028
/*
2029
* Ok, found a page in the page cache, now we need to check
2030
* that it's up-to-date.
2031
*/
2032
if (!Page_Uptodate(page))
2033
goto page_not_uptodate;
2011 Calculate the size of the le in pages
2012
If the faulting
pgoff
is beyond the end of the le and this is not a tracing
process, return NULL
2016-2017
If the VMA maps beyond the end of the le, then set the size of the
le to be the end of the mapping
2022-2024 Search for the page in the page cache
2025-2026
If it does not exist, goto
no_cached_page
where
page_cache_read()
will be called to read the page from backing storage
2032-2033 If the page is not up-to-date, goto page_not_uptodate where the page
will either be declared invalid or else the data in the page updated
2035 success:
2036
/*
2037
* Try read-ahead for sequential areas.
2038
*/
2039
if (VM_SequentialReadHint(area))
2040
nopage_sequential_readahead(area, pgoff, size);
2041
2042
/*
D.6.4 Reading Pages for the Page Cache (filemap_nopage())
2043
2044
2045
2046
2047
2048
2049
371
* Found the page and have a reference on it, need to check sharing
* and possibly copy it over to another page..
*/
mark_page_accessed(page);
flush_page_to_ram(page);
return page;
2039-2040 If this mapping specied the VM_SEQ_READ hint, then the pages are the
current fault will be pre-faulted with
nopage_sequential_readahead()
2046 Mark the faulted-in page as accessed so it will be moved to the active_list
2047
As the page is about to be installed into a process page table,
flush_page_to_ram()
call
so that recent stores by the kernel to the page will
denitly be visible to userspace
2048 Return the faulted-in page
2050 no_cached_page:
2051
/*
2052
* If the requested offset is within our file, try to read
2053
* a whole cluster of pages at once.
2054
*
2055
* Otherwise, we're off the end of a privately mapped file,
2056
* so we need to map a zero page.
2057
*/
2058
if ((pgoff < size) && !VM_RandomReadHint(area))
2059
error = read_cluster_nonblocking(file, pgoff, size);
2060
else
2061
error = page_cache_read(file, pgoff);
2062
2063
/*
2064
* The page we want has now been added to the page cache.
2065
* In the unlikely event that someone removed it in the
2066
* meantime, we'll just come back here and read it again.
2067
*/
2068
if (error >= 0)
2069
goto retry_find;
2070
2071
/*
2072
* An error return from page_cache_read can result if the
2073
* system is low on memory, or a problem occurs while trying
2074
* to schedule I/O.
2075
*/
2076
if (error == -ENOMEM)
D.6.4 Reading Pages for the Page Cache (filemap_nopage())
2077
2078
372
return NOPAGE_OOM;
return NULL;
2058-2059
If the end of the le has not been reached and the random-read hint
has not been specied, call
read_cluster_nonblocking() to pre-fault in just
a few pages near ths faulting page
2061
Else, the le is being accessed randomly, so just call
page_cache_read()
(See Section D.6.4.2) to read in just the faulting page
2068-2069 If no error occurred, goto retry_find at line 1958 which will check to
make sure the page is in the page cache before returning
2076-2077
If the error was due to being out of memory, return that so the fault
handler can act accordingly
2078 Else return NULL to indicate that a non-existant page was faulted resulting
in a
SIGBUS
signal being sent to the faulting process
2080 page_not_uptodate:
2081
lock_page(page);
2082
2083
/* Did it get unhashed while we waited for it? */
2084
if (!page->mapping) {
2085
UnlockPage(page);
2086
page_cache_release(page);
2087
goto retry_all;
2088
}
2089
2090
/* Did somebody else get it up-to-date? */
2091
if (Page_Uptodate(page)) {
2092
UnlockPage(page);
2093
goto success;
2094
}
2095
2096
if (!mapping->a_ops->readpage(file, page)) {
2097
wait_on_page(page);
2098
if (Page_Uptodate(page))
2099
goto success;
2100
}
In this block, the page was found but it was not up-to-date so the reasons for the
page not being up to date are checked. If it looks ok, the appropriate
function is called to resync the page.
2081 Lock the page for IO
readpage()
D.6.4 Reading Pages for the Page Cache (filemap_nopage())
2084-2088
373
If the page was removed from the mapping (possible because of a le
truncation) and is now anonymous, then goto
retry_all
which will try and
fault in the page again
2090-2094
Check again if the
Uptodate
ag in case the page was updated just
before we locked the page for IO
2096
Call the
address_space→readpage()
function to schedule the data to be
read from disk
2097
Wait for the IO to complete and if it is now up-to-date, goto
return the page. If the
readpage()
success
to
function failed, fall through to the error
recovery path
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
/*
* Umm, take care of errors if the page isn't up-to-date.
* Try to re-read it _once_. We do this synchronously,
* because there really aren't any performance issues here
* and we need to check for errors.
*/
lock_page(page);
/* Somebody truncated the page on us? */
if (!page->mapping) {
UnlockPage(page);
page_cache_release(page);
goto retry_all;
}
/* Somebody else successfully read it in? */
if (Page_Uptodate(page)) {
UnlockPage(page);
goto success;
}
ClearPageError(page);
if (!mapping->a_ops->readpage(file, page)) {
wait_on_page(page);
if (Page_Uptodate(page))
goto success;
}
/*
* Things didn't work out. Return zero to tell the
* mm layer so, possibly freeing the page cache page first.
*/
D.6.4 Reading Pages for the Page Cache (filemap_nopage())
2133
2134
2135 }
374
page_cache_release(page);
return NULL;
In this path, the page is not up-to-date due to some IO error. A second attempt
is made to read the page data and if it fails, return.
2110-2127
that
This is almost identical to the previous block. The only dierence is
ClearPageError()
is called to clear the error caused by the previous IO
2133 If it still failed, release the reference to the page because it is useless
2134 Return NULL because the fault failed
D.6.4.2
Function: page_cache_read()
(mm/lemap.c)
This function adds the page corresponding to the offset within the file to the
page cache if it does not exist there already.
702 static int page_cache_read(struct file * file,
unsigned long offset)
703 {
704
struct address_space *mapping =
file->f_dentry->d_inode->i_mapping;
705
struct page **hash = page_hash(mapping, offset);
706
struct page *page;
707
708
spin_lock(&pagecache_lock);
709
page = __find_page_nolock(mapping, offset, *hash);
710
spin_unlock(&pagecache_lock);
711
if (page)
712
return 0;
713
714
page = page_cache_alloc(mapping);
715
if (!page)
716
return -ENOMEM;
717
718
if (!add_to_page_cache_unique(page, mapping, offset, hash)) {
719
int error = mapping->a_ops->readpage(file, page);
720
page_cache_release(page);
721
return error;
722
}
723
/*
724
* We arrive here in the unlikely event that someone
725
* raced with us and added our page to the cache first.
726
*/
727
page_cache_release(page);
D.6.5 File Readahead for nopage()
728
729 }
375
return 0;
704 Acquire the address_space mapping managing the le
705 The page cache is a hash table and page_hash() returns the rst page in the
bucket for this
mapping
and
offset
708-709 Search the page cache with __find_page_nolock() (See Section J.1.4.3).
This basically will traverse the list starting at
hash to see if the requested page
can be found
711-712 If the page is already in the page cache, return
714
Allocate a new page for insertion into the page cache.
page_cache_alloc()
will allocate a page from the buddy allocator using GFP mask information
contained in
718
mapping
Insert the page into the page cache with
add_to_page_cache_unique()
(See Section J.1.1.2). This function is used because a second check needs to
be made to make sure the page was not inserted into the page cache while the
pagecache_lock
spinlock was not acquired
719 If the allocated page was inserted into the page cache, it needs to be populated
with data so the
readpage() function for the mapping is called.
This schedules
the IO to take place and the page will be unlocked when the IO completes
720 The path in add_to_page_cache_unique() (See Section J.1.1.2) takes an extra reference to the page being added to the page cache which is dropped here.
The page will not be freed
727
If another process added the page to the page cache, it is released here by
page_cache_release()
as there will be no users of the page
D.6.5 File Readahead for nopage()
D.6.5.1
Function: nopage_sequential_readahead()
This function is only called by
filemap_nopage()
(mm/lemap.c)
VM_SEQ_READ
when the
ag
has been specied in the VMA. When half of the current readahead-window has
been faulted in, the next readahead window is scheduled for IO and pages from the
previous window are freed.
1936 static void nopage_sequential_readahead(
struct vm_area_struct * vma,
1937
unsigned long pgoff, unsigned long filesize)
1938 {
1939
unsigned long ra_window;
1940
D.6.5 File Readahead for nopage() (nopage_sequential_readahead())
1941
1942
1943
1944
376
ra_window = get_max_readahead(vma->vm_file->f_dentry->d_inode);
ra_window = CLUSTER_OFFSET(ra_window + CLUSTER_PAGES - 1);
/* vm_raend is zero if we haven't read ahead
* in this area yet. */
if (vma->vm_raend == 0)
vma->vm_raend = vma->vm_pgoff + ra_window;
1945
1946
1947
1941 get_max_readahead() returns the maximum sized readahead window for the
block device the specied inode resides on
1942 CLUSTER_PAGES
bulk.
The macro
is the number of pages that are paged-in or paged-out in
CLUSTER_OFFSET()
will align the readahead window to a
cluster boundary
1180-1181
If read-ahead has not occurred yet, set the end of the read-ahead
window (vm_reend)
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
/*
* If we've just faulted the page half-way through our window,
* then schedule reads for the next window, and release the
* pages in the previous window.
*/
if ((pgoff + (ra_window >> 1)) == vma->vm_raend) {
unsigned long start = vma->vm_pgoff + vma->vm_raend;
unsigned long end = start + ra_window;
if (end > ((vma->vm_end >> PAGE_SHIFT) + vma->vm_pgoff))
end = (vma->vm_end >> PAGE_SHIFT) + vma->vm_pgoff;
if (start > end)
return;
while ((start < end) && (start < filesize)) {
if (read_cluster_nonblocking(vma->vm_file,
start, filesize) < 0)
break;
start += CLUSTER_PAGES;
}
run_task_queue(&tq_disk);
/* if we're far enough past the beginning of this area,
recycle pages that are in the previous window. */
if (vma->vm_raend >
(vma->vm_pgoff + ra_window + ra_window)) {
unsigned long window = ra_window << PAGE_SHIFT;
D.6.5 File Readahead for nopage() (nopage_sequential_readahead())
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984 }
}
}
377
end = vma->vm_start + (vma->vm_raend << PAGE_SHIFT);
end -= window + window;
filemap_sync(vma, end - window, window, MS_INVALIDATE);
vma->vm_raend += ra_window;
return;
1953 If the fault has occurred half-way through the read-ahead window then schedule the next readahead window to be read in from disk and free the pages for
the rst half of the current window as they are presumably not required any
more
1954-1955 Calculate the start and end of the next readahead window as we are
about to schedule it for IO
1957
If the end of the readahead window is after the end of the VMA, then set
end
to the end of the VMA
1959-1960
If we are at the end of the mapping, just return as there is no more
readahead to perform
1962-1967
Schedule the next readahead window to be paged in by calling
read_cluster_nonblocking()(See
Section D.6.5.2)
1968 Call run_task_queue() to start the IO
1972-1978 Recycle the pages in the previous read-ahead window with filemap_sync()
as they are no longer required
1980 Update where the end of the readahead window is
D.6.5.2
Function: read_cluster_nonblocking()
(mm/lemap.c)
737 static int read_cluster_nonblocking(struct file * file,
unsigned long offset,
738
unsigned long filesize)
739 {
740
unsigned long pages = CLUSTER_PAGES;
741
742
offset = CLUSTER_OFFSET(offset);
743
while ((pages-- > 0) && (offset < filesize)) {
744
int error = page_cache_read(file, offset);
D.6.6 Swap Related Read-Ahead
745
746
747
748
749
750
751 }
}
378
if (error < 0)
return error;
offset ++;
return 0;
740 CLUSTER_PAGES will be 4 pages in low memory systems and 8 pages in larger
ones. This means that on an x86 with ample memory, 32KiB will be read in
one cluster
742 CLUSTER_OFFSET() will align the oset to a cluster-sized alignment
743-748
Read the full cluster into the page cache by calling
page_cache_read()
(See Section D.6.4.2) for each page in the cluster
745-746 If an error occurs during read-ahead, return the error
750 Return success
D.6.6 Swap Related Read-Ahead
D.6.6.1
Function: swapin_readahead()
(mm/memory.c)
This function will fault in a number of pages after the current entry. It will stop
with either
CLUSTER_PAGES have been swapped in or an unused swap entry is found.
1093 void swapin_readahead(swp_entry_t entry)
1094 {
1095
int i, num;
1096
struct page *new_page;
1097
unsigned long offset;
1098
1099
/*
1100
* Get the number of handles we should do readahead io to.
1101
*/
1102
num = valid_swaphandles(entry, &offset);
1103
for (i = 0; i < num; offset++, i++) {
1104
/* Ok, do the async read-ahead now */
1105
new_page = read_swap_cache_async(SWP_ENTRY(SWP_TYPE(entry),
offset));
1106
if (!new_page)
1107
break;
1108
page_cache_release(new_page);
1109
}
1110
return;
1111 }
D.6.6 Swap Related Read-Ahead (swapin_readahead())
1102 valid_swaphandles()
swapped in.
379
is what determines how many pages should be
It will stop at the rst empty entry or when
CLUSTER_PAGES
is reached
1103-1109 Swap in the pages
1105 Attempt to swap the page into the swap cache with read_swap_cache_async()
(See Section K.3.1.1)
1106-1107 If the page could not be paged in, break and return
1108 Drop the reference to the page that read_swap_cache_async() takes
1110 Return
D.6.6.2
Function: valid_swaphandles()
(mm/swaple.c)
This function determines how many pages should be readahead from swap start-
offset. It will readahead to the next unused swap slot but at most, it will
CLUSTER_PAGES.
ing from
return
1238 int valid_swaphandles(swp_entry_t entry, unsigned long *offset)
1239 {
1240
int ret = 0, i = 1 << page_cluster;
1241
unsigned long toff;
1242
struct swap_info_struct *swapdev = SWP_TYPE(entry) + swap_info;
1243
1244
if (!page_cluster)
/* no readahead */
1245
return 0;
1246
toff = (SWP_OFFSET(entry) >> page_cluster) << page_cluster;
1247
if (!toff)
/* first page is swap header */
1248
toff++, i--;
1249
*offset = toff;
1250
1251
swap_device_lock(swapdev);
1252
do {
1253
/* Don't read-ahead past the end of the swap area */
1254
if (toff >= swapdev->max)
1255
break;
1256
/* Don't read in free or bad pages */
1257
if (!swapdev->swap_map[toff])
1258
break;
1259
if (swapdev->swap_map[toff] == SWAP_MAP_BAD)
1260
break;
1261
toff++;
1262
ret++;
1263
} while (--i);
D.6.6 Swap Related Read-Ahead (valid_swaphandles())
1264
1265
1266 }
380
swap_device_unlock(swapdev);
return ret;
1240 i is set to CLUSTER_PAGES which is the equivalent of the bitshift shown here
1242 Get the swap_info_struct that contains this entry
1244-1245 If readahead has been disabled, return
1246
Calculate
toff
to be
entry
rounded down to the nearest
CLUSTER_PAGES-
sized boundary
1247-1248 If toff is 0, move it to 1 as the rst page contains information about
the swap area
1251 Lock the swap device as we are about to scan it
1252-1263 Loop at most i, which is initialised to CLUSTER_PAGES, times
1254-1255
If the end of the swap area is reached, then that is as far as can be
readahead
1257-1258 If an unused entry is reached,
just return as it is as far as we want to
readahead
1259-1260 Likewise, return if a bad entry is discovered
1261 Move to the next slot
1262 Increment the number of pages to be readahead
1264 Unlock the swap device
1265 Return the number of pages which should be readahead
Appendix E
Boot Memory Allocator
Contents
E.1 Initialising the Boot Memory Allocator . . . . . . . . . . . . . . 382
E.1.1 Function: init_bootmem() . . . . . . . . . . . . . . . . . . . . . 382
E.1.2 Function: init_bootmem_node() . . . . . . . . . . . . . . . . . . 382
E.1.3 Function: init_bootmem_core() . . . . . . . . . . . . . . . . . . 383
E.2 Allocating Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 385
E.2.1 Reserving Large Regions of Memory . . . . .
E.2.1.1 Function: reserve_bootmem() . . .
E.2.1.2 Function: reserve_bootmem_node()
E.2.1.3 Function: reserve_bootmem_core()
E.2.2 Allocating Memory at Boot Time . . . . . . .
E.2.2.1 Function: alloc_bootmem() . . . .
E.2.2.2 Function: __alloc_bootmem() . . .
E.2.2.3 Function: alloc_bootmem_node() .
E.2.2.4 Function: __alloc_bootmem_node()
E.2.2.5 Function: __alloc_bootmem_core()
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
385
385
385
386
387
387
387
388
389
389
E.3 Freeing Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
E.3.1 Function: free_bootmem() . . . . . . . . . . . . . . . . . . . . . 395
E.3.2 Function: free_bootmem_core() . . . . . . . . . . . . . . . . . . 395
E.4 Retiring the Boot Memory Allocator . . . . . . . . . . . . . . . . 397
E.4.1
E.4.2
E.4.3
E.4.4
E.4.5
Function:
Function:
Function:
Function:
Function:
mem_init() . . . . . . . . . . . . . . . . . . . . . . . . 397
free_pages_init() . . . . . . . . . . . . . . . . . . . 399
one_highpage_init() . . . . . . . . . . . . . . . . . . 400
free_all_bootmem() . . . . . . . . . . . . . . . . . . . 401
free_all_bootmem_core() . . . . . . . . . . . . . . . 401
381
E.1 Initialising the Boot Memory Allocator
382
E.1 Initialising the Boot Memory Allocator
Contents
E.1 Initialising the Boot Memory Allocator
E.1.1 Function: init_bootmem()
E.1.2 Function: init_bootmem_node()
E.1.3 Function: init_bootmem_core()
382
382
382
383
The functions in this section are responsible for bootstrapping the boot memory allocator.
It starts with the architecture specic function
setup_memory()
(See Section B.1.1) but all architectures cover the same basic tasks in the architecture specic function before calling the architectur independant function
init_bootmem().
E.1.1
Function: init_bootmem()
(mm/bootmem.c)
This is called by UMA architectures to initialise their boot memory allocator
structures.
304 unsigned long __init init_bootmem (unsigned long start,
unsigned long pages)
305 {
306
max_low_pfn = pages;
307
min_low_pfn = start;
308
return(init_bootmem_core(&contig_page_data, start, 0, pages));
309 }
304
Confusingly, the
pages
parameter is actually the end PFN of the memory
addressable by this node, not the number of pages as the name impies
306 Set the max PFN addressable by this node in case the architecture dependent
code did not
307 Set the min PFN addressable by this node in case the architecture dependent
code did not
308
init_bootmem_core()(See
initialising the bootmem_data
E.1.2
Call
Section E.1.3) which does the real work of
Function: init_bootmem_node()
(mm/bootmem.c)
This is called by NUMA architectures to initialise boot memory allocator data
for a given node.
284 unsigned long __init init_bootmem_node (pg_data_t *pgdat,
unsigned long freepfn,
unsigned long startpfn,
unsigned long endpfn)
285 {
286
return(init_bootmem_core(pgdat, freepfn, startpfn, endpfn));
287 }
E.1 Initialising the Boot Memory Allocator (init_bootmem_node())
383
286 Just call init_bootmem_core()(See Section E.1.3) directly
E.1.3
Function: init_bootmem_core()
(mm/bootmem.c)
struct bootmem_data_t and inserts
pgdat_list.
Initialises the appropriate
the linked list of nodes
the node into
46 static unsigned long __init init_bootmem_core (pg_data_t *pgdat,
47
unsigned long mapstart, unsigned long start, unsigned long end)
48 {
49
bootmem_data_t *bdata = pgdat->bdata;
50
unsigned long mapsize = ((end - start)+7)/8;
51
52
pgdat->node_next = pgdat_list;
53
pgdat_list = pgdat;
54
55
mapsize = (mapsize + (sizeof(long) - 1UL)) &
~(sizeof(long) - 1UL);
56
bdata->node_bootmem_map = phys_to_virt(mapstart << PAGE_SHIFT);
57
bdata->node_boot_start = (start << PAGE_SHIFT);
58
bdata->node_low_pfn = end;
59
60
/*
61
* Initially all pages are reserved - setup_arch() has to
62
* register free RAM areas explicitly.
63
*/
64
memset(bdata->node_bootmem_map, 0xff, mapsize);
65
66
return mapsize;
67 }
46 The parameters are;
pgdat is the node descriptor been initialised
mapstart is the beginning of the memory that will be usable
start is the beginning PFN of the node
end is the end PFN of the node
50 Each page requires one bit to represent it so the size of the map required is the
number of pages in this node rounded up to the nearest multiple of 8 and then
divided by 8 to give the number of bytes required
52-53
As the node will be shortly considered initialised, insert it into the global
pgdat_list
55 Round the mapsize up to the closest word boundary
E.1 Initialising the Boot Memory Allocator (init_bootmem_core())
384
56 Convert the mapstart to a virtual address and store it in bdata→node_bootmem_map
57 Convert the starting PFN to a physical address and store it on node_boot_start
58 Store the end PFN of ZONE_NORMAL in node_low_pfn
64
Fill the full map with 1's marking all pages as allocated.
architecture dependent code to mark the usable pages
It is up to the
E.2 Allocating Memory
385
E.2 Allocating Memory
Contents
E.2 Allocating Memory
E.2.1 Reserving Large Regions of Memory
E.2.1.1 Function: reserve_bootmem()
E.2.1.2 Function: reserve_bootmem_node()
E.2.1.3 Function: reserve_bootmem_core()
E.2.2 Allocating Memory at Boot Time
E.2.2.1 Function: alloc_bootmem()
E.2.2.2 Function: __alloc_bootmem()
E.2.2.3 Function: alloc_bootmem_node()
E.2.2.4 Function: __alloc_bootmem_node()
E.2.2.5 Function: __alloc_bootmem_core()
385
385
385
385
386
387
387
387
388
389
389
E.2.1 Reserving Large Regions of Memory
E.2.1.1
Function: reserve_bootmem()
(mm/bootmem.c)
311 void __init reserve_bootmem (unsigned long addr, unsigned long size)
312 {
313
reserve_bootmem_core(contig_page_data.bdata, addr, size);
314 }
313 Just call reserve_bootmem_core()(See Section E.2.1.3).
NUMA architecture, the node to allocate from is the static
As this is for a non-
contig_page_data
node.
E.2.1.2
Function: reserve_bootmem_node()
(mm/bootmem.c)
289 void __init reserve_bootmem_node (pg_data_t *pgdat,
unsigned long physaddr,
unsigned long size)
290 {
291
reserve_bootmem_core(pgdat->bdata, physaddr, size);
292 }
291
Just call
reserve_bootmem_core()(See
mem data of the requested node
Section E.2.1.3) passing it the boot-
E.2.1.3 Function: reserve_bootmem_core()
E.2.1.3
Function: reserve_bootmem_core()
386
(mm/bootmem.c)
74 static void __init reserve_bootmem_core(bootmem_data_t *bdata,
unsigned long addr,
unsigned long size)
75 {
76
unsigned long i;
77
/*
78
* round up, partially reserved pages are considered
79
* fully reserved.
80
*/
81
unsigned long sidx = (addr - bdata->node_boot_start)/PAGE_SIZE;
82
unsigned long eidx = (addr + size - bdata->node_boot_start +
83
PAGE_SIZE-1)/PAGE_SIZE;
84
unsigned long end = (addr + size + PAGE_SIZE-1)/PAGE_SIZE;
85
86
if (!size) BUG();
87
88
if (sidx < 0)
89
BUG();
90
if (eidx < 0)
91
BUG();
92
if (sidx >= eidx)
93
BUG();
94
if ((addr >> PAGE_SHIFT) >= bdata->node_low_pfn)
95
BUG();
96
if (end > bdata->node_low_pfn)
97
BUG();
98
for (i = sidx; i < eidx; i++)
99
if (test_and_set_bit(i, bdata->node_bootmem_map))
100
printk("hm, page %08lx reserved twice.\n",
i*PAGE_SIZE);
101 }
81
The
sidx
is the starting index to serve pages from. The value is obtained by
subtracting the starting address from the requested address and dividing by
the size of a page
82 A similar calculation is made for the ending index eidx except that the allocation is rounded up to the nearest page. This means that requests to partially
reserve a page will result in the full page being reserved
84 end is the last PFN that is aected by this reservation
86 Check that a non-zero value has been given
88-89 Check the starting index is not before the start of the node
E.2.2 Allocating Memory at Boot Time
387
90-91 Check the end index is not before the start of the node
92-93 Check the starting index is not after the end index
94-95
Check the starting address is not beyond the memory this bootmem node
represents
96-97
Check the ending address is not beyond the memory this bootmem node
represents
88-100
Starting with
sidx
and nishing with
eidx,
test and set the bit in the
bootmem map that represents the page marking it as allocated. If the bit was
already set to 1, print out a message saying it was reserved twice
E.2.2 Allocating Memory at Boot Time
E.2.2.1
Function: alloc_bootmem()
(mm/bootmem.c)
The callgraph for these macros is shown in Figure 5.1.
38
39
40
41
42
43
44
45
#define alloc_bootmem(x) \
__alloc_bootmem((x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS))
#define alloc_bootmem_low(x) \
__alloc_bootmem((x), SMP_CACHE_BYTES, 0)
#define alloc_bootmem_pages(x) \
__alloc_bootmem((x), PAGE_SIZE, __pa(MAX_DMA_ADDRESS))
#define alloc_bootmem_low_pages(x) \
__alloc_bootmem((x), PAGE_SIZE, 0)
39 alloc_bootmem()
will align to the L1 hardware cache and start searching for
a page after the maximum address usable for DMA
40 alloc_bootmem_low() will align to the L1 hardware cache and start searching
from page 0
42 alloc_bootmem_pages()
will align the allocation to a page size so that full
pages will be allocated starting from the maximum address usable for DMA
44 alloc_bootmem_pages()
will align the allocation to a page size so that full
pages will be allocated starting from physical address 0
E.2.2.2
Function: __alloc_bootmem()
(mm/bootmem.c)
326 void * __init __alloc_bootmem (unsigned long size,
unsigned long align, unsigned long goal)
327 {
328
pg_data_t *pgdat;
329
void *ptr;
E.2.2 Allocating Memory at Boot Time (__alloc_bootmem())
330
331
332
333
334
335
336
337
338
339
340
341
342 }
388
for_each_pgdat(pgdat)
if ((ptr = __alloc_bootmem_core(pgdat->bdata, size,
align, goal)))
return(ptr);
/*
* Whoops, we cannot satisfy the allocation request.
*/
printk(KERN_ALERT "bootmem alloc of %lu bytes failed!\n", size);
panic("Out of memory");
return NULL;
326 The parameters are;
size is the size of the requested allocation
align is the desired alignment and must be
SMP_CACHE_BYTES
goal
331-334
or
a power of 2. Currently either
PAGE_SIZE
is the starting address to begin searching from
Cycle through all available nodes and try allocating from each in turn.
In the UMA case, this will just allocate from the
349-340
contig_page_data
node
If the allocation fails, the system is not going to be able to boot so the
kernel panics
E.2.2.3
Function: alloc_bootmem_node()
(mm/bootmem.c)
53 #define alloc_bootmem_node(pgdat, x) \
54
__alloc_bootmem_node((pgdat), (x), SMP_CACHE_BYTES,
__pa(MAX_DMA_ADDRESS))
55 #define alloc_bootmem_pages_node(pgdat, x) \
56
__alloc_bootmem_node((pgdat), (x), PAGE_SIZE,
__pa(MAX_DMA_ADDRESS))
57 #define alloc_bootmem_low_pages_node(pgdat, x) \
58
__alloc_bootmem_node((pgdat), (x), PAGE_SIZE, 0)
53-54 alloc_bootmem_node()
will allocate from the requested node and align
to the L1 hardware cache and start searching for a page beginning with
ZONE_NORMAL
(i.e. at the end of
55-56 alloc_bootmem_pages()
ZONE_DMA
which is at
MAX_DMA_ADDRESS)
will allocate from the requested node and align
the allocation to a page size so that full pages will be allocated starting from
the
ZONE_NORMAL
E.2.2 Allocating Memory at Boot Time (alloc_bootmem_node())
57-58 alloc_bootmem_pages()
389
will allocate from the requested node and align
the allocation to a page size so that full pages will be allocated starting from
physical address 0 so that
E.2.2.4
ZONE_DMA
will be used
Function: __alloc_bootmem_node()
(mm/bootmem.c)
344 void * __init __alloc_bootmem_node (pg_data_t *pgdat,
unsigned long size,
unsigned long align,
unsigned long goal)
345 {
346
void *ptr;
347
348
ptr = __alloc_bootmem_core(pgdat->bdata, size, align, goal);
349
if (ptr)
350
return (ptr);
351
352
/*
353
* Whoops, we cannot satisfy the allocation request.
354
*/
355
printk(KERN_ALERT "bootmem alloc of %lu bytes failed!\n", size);
356
panic("Out of memory");
357
return NULL;
358 }
344 The parameters are the same as for __alloc_bootmem_node() (See Section E.2.2.4)
except the node to allocate from is specied
348 Call the core function __alloc_bootmem_core() (See Section E.2.2.5) to perform the allocation
349-350 Return a pointer if it was successful
355-356
Otherwise print out a message and panic the kernel as the system will
not boot if memory can not be allocated even now
E.2.2.5
Function: __alloc_bootmem_core()
(mm/bootmem.c)
This is the core function for allocating memory from a specied node with the
boot memory allocator. It is quite large and broken up into the following tasks;
•
Function preamble. Make sure the parameters are sane
•
Calculate the starting address to scan from based on the
•
Check to see if this allocation may be merged with the page used for the
previous allocation to save memory.
goal
parameter
E.2.2 Allocating Memory at Boot Time (__alloc_bootmem_core())
•
390
Mark the pages allocated as 1 in the bitmap and zero out the contents of the
pages
144 static void * __init __alloc_bootmem_core (bootmem_data_t *bdata,
145
unsigned long size, unsigned long align, unsigned long goal)
146 {
147
unsigned long i, start = 0;
148
void *ret;
149
unsigned long offset, remaining_size;
150
unsigned long areasize, preferred, incr;
151
unsigned long eidx = bdata->node_low_pfn 152
(bdata->node_boot_start >> PAGE_SHIFT);
153
154
if (!size) BUG();
155
156
if (align & (align-1))
157
BUG();
158
159
offset = 0;
160
if (align &&
161
(bdata->node_boot_start & (align - 1UL)) != 0)
162
offset = (align - (bdata->node_boot_start &
(align - 1UL)));
163
offset >>= PAGE_SHIFT;
Function preamble, make sure the parameters are sane
144 The parameters are;
bdata is the bootmem for the struct being allocated from
size is the size of the requested allocation
align is the desired alignment for the allocation. Must be a power of 2
goal is the preferred address to allocate above if possible
151 Calculate the ending bit index eidx which returns the highest page index that
may be used for the allocation
154 Call BUG() if a request size of 0 is specied
156-156 If the alignment is not a power of 2, call BUG()
159 The default oset for alignments is 0
160 If an alignment has been specied and...
161
And the requested alignment is the same alignment as the start of the node
then calculate the oset to use
E.2.2 Allocating Memory at Boot Time (__alloc_bootmem_core())
162
The oset to use is the requested alignment masked against the lower bits of
the starting address. In reality, this
for the prevalent values of
169
170
171
172
173
174
175
176
177
178
align
offset
will likely be identical to
align
if (goal && (goal >= bdata->node_boot_start) &&
((goal >> PAGE_SHIFT) < bdata->node_low_pfn)) {
preferred = goal - bdata->node_boot_start;
} else
preferred = 0;
preferred = ((preferred + align - 1) & ~(align - 1))
>> PAGE_SHIFT;
preferred += offset;
areasize = (size+PAGE_SIZE-1)/PAGE_SIZE;
incr = align >> PAGE_SHIFT ? : 1;
Calculate the starting PFN to start scanning from based on the
169
391
goal parameter.
If a goal has been specied and the goal is after the starting address for this
node and the PFN of the goal is less than the last PFN adressable by this
node then ....
170 The preferred oset to start from is the goal minus the beginning of the memory
addressable by this node
173 Else the preferred oset is 0
175-176
Adjust the preferred address to take the oset into account so that the
address will be correctly aligned
177
The number of pages that will be aected by this allocation is stored in
areasize
178 incr
is the number of pages that have to be skipped to satisify alignment
requirements if they are over one page
179
180 restart_scan:
181
for (i = preferred; i < eidx; i += incr) {
182
unsigned long j;
183
if (test_bit(i, bdata->node_bootmem_map))
184
continue;
185
for (j = i + 1; j < i + areasize; ++j) {
186
if (j >= eidx)
187
goto fail_block;
188
if (test_bit (j, bdata->node_bootmem_map))
189
goto fail_block;
E.2.2 Allocating Memory at Boot Time (__alloc_bootmem_core())
190
191
192
193
194
195
196
197
198
199
392
}
start = i;
goto found;
fail_block:;
}
if (preferred) {
preferred = offset;
goto restart_scan;
}
return NULL;
Scan through memory looking for a block large enough to satisfy this request
180 If the allocation could not be satisifed starting from goal, this label is jumped
to so that the map will be rescanned
181-194
Starting from
preferred,
scan lineraly searching for a free block large
enough to satisfy the request. Walk the address space in
incr
steps to satisfy
alignments greater than one page. If the alignment is less than a page, incr
will just be 1
183-184 Test the bit, if it is already 1, it is not free so move to the next page
185-190 Scan the next areasize number of pages and see if they are also free.
It
fails if the end of the addressable space is reached (eidx) or one of the pages
is already in use
191-192 A free block is found so record the start and jump to the found block
195-198 The allocation failed so start again from the beginning
199 If that also failed, return NULL which will result in a kernel panic
200 found:
201
if (start >= eidx)
202
BUG();
203
209
if (align <= PAGE_SIZE
210
&& bdata->last_offset && bdata->last_pos+1 == start) {
211
offset = (bdata->last_offset+align-1) & ~(align-1);
212
if (offset > PAGE_SIZE)
213
BUG();
214
remaining_size = PAGE_SIZE-offset;
215
if (size < remaining_size) {
216
areasize = 0;
217
// last_pos unchanged
218
bdata->last_offset = offset+size;
E.2.2 Allocating Memory at Boot Time (__alloc_bootmem_core())
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
393
ret = phys_to_virt(bdata->last_pos*PAGE_SIZE + offset +
bdata->node_boot_start);
} else {
remaining_size = size - remaining_size;
areasize = (remaining_size+PAGE_SIZE-1)/PAGE_SIZE;
ret = phys_to_virt(bdata->last_pos*PAGE_SIZE +
offset +
bdata->node_boot_start);
bdata->last_pos = start+areasize-1;
bdata->last_offset = remaining_size;
}
bdata->last_offset &= ~PAGE_MASK;
} else {
bdata->last_pos = start + areasize - 1;
bdata->last_offset = size & ~PAGE_MASK;
ret = phys_to_virt(start * PAGE_SIZE +
bdata->node_boot_start);
}
Test to see if this allocation may be merged with the previous allocation.
201-202 Check that the start of the allocation is not after the addressable memory.
This check was just made so it is redundent
209-230 Try and merge with the previous allocation if the alignment is less than a
PAGE_SIZE, the previously page has space in it (last_offset != 0) and that
the previously used page is adjactent to the page found for this allocation
231-234
Else record the pages and oset used for this allocation to be used for
merging with the next allocation
211 Update the oset to use to be aligned correctly for the requested align
212-213
If the oset now goes over the edge of a page,
BUG()
is called.
This
condition would require a very poor choice of alignment to be used. As the
only alignment commonly used is a factor of
PAGE_SIZE,
it is impossible for
normal usage
214 remaining_size is the remaining free space in the previously used page
215-221 If there is enough space left in the old page then use the old page totally
and update the
221-228
bootmem_data
struct to reect it
Else calculate how many pages in addition to this one will be required
and update the
bootmem_data
216 The number of pages used by this allocation is now 0
E.2.2 Allocating Memory at Boot Time (__alloc_bootmem_core())
394
218 Update the last_offset to be the end of this allocation
219 Calculate the virtual address to return for the successful allocation
222 remaining_size is how space will be used in the last page used to satisfy the
allocation
223 Calculate how many more pages are needed to satisfy the allocation
224 Record the address the allocation starts from
226
The last page used is the
start
required to satisfy this allocation
page plus the number of additional pages
areasize
227 The end of the allocation has already been calculated
229 If the oset is at the end of the page, make it 0
231 No merging took place so record the last page used to satisfy this allocation
232 Record how much of the last page was used
233 Record the starting virtual address of the allocation
238
239
240
241
242
243 }
for (i = start; i < start+areasize; i++)
if (test_and_set_bit(i, bdata->node_bootmem_map))
BUG();
memset(ret, 0, size);
return ret;
Mark the pages allocated as 1 in the bitmap and zero out the contents of the
pages
238-240 Cycle through all pages used for this allocation and set the bit to 1 in the
bitmap. If any of them are already 1, then a double allocation took place so
call
BUG()
241 Zero ll the pages
242 Return the address of the allocation
E.3 Freeing Memory
395
E.3 Freeing Memory
Contents
E.3 Freeing Memory
E.3.1 Function: free_bootmem()
E.3.2 Function: free_bootmem_core()
E.3.1
Function: free_bootmem()
395
395
395
(mm/bootmem.c)
Figure E.1: Call Graph:
free_bootmem()
294 void __init free_bootmem_node (pg_data_t *pgdat,
unsigned long physaddr, unsigned long size)
295 {
296
return(free_bootmem_core(pgdat->bdata, physaddr, size));
297 }
316 void __init free_bootmem (unsigned long addr, unsigned long size)
317 {
318
return(free_bootmem_core(contig_page_data.bdata, addr, size));
319 }
296 Call the core function with the corresponding bootmem data for the requested
node
318 Call the core function with the bootmem data for contig_page_data
E.3.2
Function: free_bootmem_core()
(mm/bootmem.c)
103 static void __init free_bootmem_core(bootmem_data_t *bdata,
unsigned long addr,
unsigned long size)
104 {
105
unsigned long i;
106
unsigned long start;
E.3 Freeing Memory (free_bootmem_core())
111
112
396
unsigned long sidx;
unsigned long eidx = (addr + size bdata->node_boot_start)/PAGE_SIZE;
unsigned long end = (addr + size)/PAGE_SIZE;
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129 }
if (!size) BUG();
if (end > bdata->node_low_pfn)
BUG();
/*
* Round up the beginning of the address.
*/
start = (addr + PAGE_SIZE-1) / PAGE_SIZE;
sidx = start - (bdata->node_boot_start/PAGE_SIZE);
for (i = sidx; i < eidx; i++) {
if (!test_and_clear_bit(i, bdata->node_bootmem_map))
BUG();
}
112 Calculate the end index aected as eidx
113
The end address is the end of the aected area rounded down to the nearest
page if it is not already page aligned
115 If a size of 0 is freed, call BUG
116-117 If the end PFN is after the memory addressable by this node, call BUG
122
Round the starting address up to the nearest page if it is not already page
aligned
123 Calculate the starting index to free
125-127
For all full pages that are freed by this action, clear the bit in the boot
bitmap. If it is already 0, it is a double free or is memory that was never used
so call
BUG
E.4 Retiring the Boot Memory Allocator
397
E.4 Retiring the Boot Memory Allocator
Contents
E.4 Retiring the Boot Memory Allocator
E.4.1 Function: mem_init()
E.4.2 Function: free_pages_init()
E.4.3 Function: one_highpage_init()
E.4.4 Function: free_all_bootmem()
E.4.5 Function: free_all_bootmem_core()
397
397
399
400
401
401
Once the system is started, the boot memory allocator is no longer needed so
these functions are responsible for removing unnecessary boot memory allocator
structures and passing the remaining pages to the normal physical page allocator.
E.4.1
Function: mem_init()
(arch/i386/mm/init.c)
The call graph for this function is shown in Figure 5.2. The important part of this
function for the boot memory allocator is that it calls
free_pages_init()(See
Section E.4.2).
The function is broken up into the following tasks
•
Function preamble, set the PFN within the global
mem_map for the location of
high memory and zero out the system wide zero page
free_pages_init()(See
•
Call
•
Print out an informational message on the availability of memory in the system
•
Check the CPU supports PAE if the cong option is enabled and test the
Section E.4.2)
WP bit on the CPU. This is important as without the WP bit, the function
verify_write() has to be called for every write to userspace from the kernel.
This only applies to old processors like the 386
•
Fill in entries for the userspace portion of the PGD for
swapper_pg_dir,
kernel page tables. The zero page is mapped for all entries
507 void __init mem_init(void)
508 {
509
int codesize, reservedpages, datasize, initsize;
510
511
if (!mem_map)
512
BUG();
513
514
set_max_mapnr_init();
515
516
high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
517
518
/* clear the zero-page */
519
memset(empty_zero_page, 0, PAGE_SIZE);
the
E.4 Retiring the Boot Memory Allocator (mem_init())
398
514 This function records the PFN high memory starts in mem_map (highmem_start_page),
the maximum number of pages in the system (max_mapnr and
num_physpages)
and nally the maximum number of pages that may be mapped by the kernel
(num_mappedpages)
516 high_memory is the virtual address where high memory begins
519 Zero out the system wide zero page
520
521
522
512
reservedpages = free_pages_init();
Call
free_pages_init()(See
Section E.4.2) which tells the boot memory al-
locator to retire itself as well as initialising all pages in high memory for use
with the buddy allocator
523
524
525
526
527
528
529
530
531
532
533
534
535
codesize =
datasize =
initsize =
(unsigned long) &_etext - (unsigned long) &_text;
(unsigned long) &_edata - (unsigned long) &_etext;
(unsigned long) &__init_end - (unsigned long)
&__init_begin;
printk(KERN_INFO "Memory: %luk/%luk available (%dk kernel code,
%dk reserved, %dk data, %dk init, %ldk highmem)\n",
(unsigned long) nr_free_pages() << (PAGE_SHIFT-10),
max_mapnr << (PAGE_SHIFT-10),
codesize >> 10,
reservedpages << (PAGE_SHIFT-10),
datasize >> 10,
initsize >> 10,
(unsigned long) (totalhigh_pages << (PAGE_SHIFT-10))
);
Print out an informational message
523 Calculate the size of the code segment, data segment and memory used by initialisation code and data (all functions marked
527-535
__init will be in this section)
Print out a nice message on how the availability of memory and the
amount of memory consumed by the kernel
536
537 #if CONFIG_X86_PAE
538
if (!cpu_has_pae)
539
panic("cannot execute a PAE-enabled kernel on a PAE-less
CPU!");
540 #endif
E.4 Retiring the Boot Memory Allocator (mem_init())
541
542
543
399
if (boot_cpu_data.wp_works_ok < 0)
test_wp_bit();
538-539 If PAE is enabled but the processor does not support it, panic
541-542 Test for the availability of the WP bit
550 #ifndef CONFIG_SMP
551
zap_low_mappings();
552 #endif
553
554 }
551
Cycle through each PGD used by the userspace portion of
swapper_pg_dir
and map the zero page to it
E.4.2
Function: free_pages_init()
(arch/i386/mm/init.c)
This function has two important functions, to call free_all_bootmem() (See
Section E.4.4)
to retire the boot memory allocator and to free all high memory pages to the buddy
allocator.
481 static int __init free_pages_init(void)
482 {
483
extern int ppro_with_ram_bug(void);
484
int bad_ppro, reservedpages, pfn;
485
486
bad_ppro = ppro_with_ram_bug();
487
488
/* this will put all low memory onto the freelists */
489
totalram_pages += free_all_bootmem();
490
491
reservedpages = 0;
492
for (pfn = 0; pfn < max_low_pfn; pfn++) {
493
/*
494
* Only count reserved RAM pages
495
*/
496
if (page_is_ram(pfn) && PageReserved(mem_map+pfn))
497
reservedpages++;
498
}
499 #ifdef CONFIG_HIGHMEM
500
for (pfn = highend_pfn-1; pfn >= highstart_pfn; pfn--)
501
one_highpage_init((struct page *) (mem_map + pfn), pfn,
bad_ppro);
502
totalram_pages += totalhigh_pages;
503 #endif
E.4 Retiring the Boot Memory Allocator (free_pages_init())
504
505 }
400
return reservedpages;
486 There is a bug in the Pentium Pros that prevent certain pages in high memory
being used. The function
ppro_with_ram_bug()
checks for its existance
489 Call free_all_bootmem() to retire the boot memory allocator
491-498 Cycle through all of memory and count the number of reserved pages that
were left over by the boot memory allocator
500-501 For each page in high memory, call one_highpage_init() (See Section E.4.3).
PG_reserved bit, sets the PG_high bit, sets the count
to 1, calls __free_pages() to give the page to the buddy allocator and increments the totalhigh_pages count. Pages which kill buggy Pentium Pro's
This function clears the
are skipped
E.4.3
Function: one_highpage_init()
(arch/i386/mm/init.c)
This function initialises the information for one page in high memory and checks
to make sure that the page will not trigger a bug with some Pentium Pros. It only
exists if
CONFIG_HIGHMEM
is specied at compile time.
449 #ifdef CONFIG_HIGHMEM
450 void __init one_highpage_init(struct page *page, int pfn,
int bad_ppro)
451 {
452
if (!page_is_ram(pfn)) {
453
SetPageReserved(page);
454
return;
455
}
456
457
if (bad_ppro && page_kills_ppro(pfn)) {
458
SetPageReserved(page);
459
return;
460
}
461
462
ClearPageReserved(page);
463
set_bit(PG_highmem, &page->flags);
464
atomic_set(&page->count, 1);
465
__free_page(page);
466
totalhigh_pages++;
467 }
468 #endif /* CONFIG_HIGHMEM */
452-455
If a page does not exist at the PFN, then mark the
reserved so it will not be used
struct page
as
E.4 Retiring the Boot Memory Allocator (one_highpage_init())
401
457-460 If the running CPU is susceptible to the Pentium Pro bug and this page
is a page that would cause a crash (page_kills_ppro() performs the check),
then mark the page as reserved so it will never be allocated
462
From here on, the page is a high memory page that should be used so rst
clear the reserved bit so it will be given to the buddy allocator later
463 Set the PG_highmem bit to show it is a high memory page
464 Initialise the usage count of the page to 1 which will be set to 0 by the buddy
allocator
465 Free the page with __free_page()(See Section F.4.2) so that the buddy allocator will add the high memory page to it's free lists
466 Increment the total number of available high memory pages (totalhigh_pages)
E.4.4
Function: free_all_bootmem()
(mm/bootmem.c)
299 unsigned long __init free_all_bootmem_node (pg_data_t *pgdat)
300 {
301
return(free_all_bootmem_core(pgdat));
302 }
321 unsigned long __init free_all_bootmem (void)
322 {
323
return(free_all_bootmem_core(&contig_page_data));
324 }
299-302 For NUMA, simply call the core function with the specied pgdat
321-324 For UMA, call the core function with the only node contig_page_data
E.4.5
Function: free_all_bootmem_core()
(mm/bootmem.c)
This is the core function which retires the boot memory allocator. It is divided
into two major tasks
•
For all unallocated pages known to the allocator for this node;
•
Clear the
PG_reserved
ag in its struct page
Set the count to 1
Call
__free_pages()
so that the buddy allocator can build its free lists
Free all pages used for the bitmap and free to them to the buddy allocator
E.4 Retiring the Boot Memory Allocator (free_all_bootmem_core())
402
245 static unsigned long __init free_all_bootmem_core(pg_data_t *pgdat)
246 {
247
struct page *page = pgdat->node_mem_map;
248
bootmem_data_t *bdata = pgdat->bdata;
249
unsigned long i, count, total = 0;
250
unsigned long idx;
251
252
if (!bdata->node_bootmem_map) BUG();
253
254
count = 0;
255
idx = bdata->node_low_pfn (bdata->node_boot_start >> PAGE_SHIFT);
256
for (i = 0; i < idx; i++, page++) {
257
if (!test_bit(i, bdata->node_bootmem_map)) {
258
count++;
259
ClearPageReserved(page);
260
set_page_count(page, 1);
261
__free_page(page);
262
}
263
}
264
total += count;
252
If no map is available, it means that this node has already been freed and
something woeful is wrong with the architecture dependent code so call
BUG()
254 A running count of the number of pages given to the buddy allocator
255 idx is the last index that is addressable by this node
256-263 Cycle through all pages addressable by this node
257 If the page is marked free then...
258 Increase the running count of pages given to the buddy allocator
259 Clear the PG_reserved ag
260
Set the count to 1 so that the buddy allocator will think this is the last user
of the page and place it in its free lists
261
Call the buddy allocator free function so the page will be added to it's free
lists
264 total will come the total number of pages given over by this function
270
271
272
page = virt_to_page(bdata->node_bootmem_map);
count = 0;
for (i = 0;
E.4 Retiring the Boot Memory Allocator (free_all_bootmem_core())
273
274
275
276
277
278
279
280
281
282 }
403
i < ((bdata->node_low_pfn - (bdata->node_boot_start >> PAGE_SHIFT)
)/8 + PAGE_SIZE-1)/PAGE_SIZE;
i++,page++) {
count++;
ClearPageReserved(page);
set_page_count(page, 1);
__free_page(page);
}
total += count;
bdata->node_bootmem_map = NULL;
return total;
Free the allocator bitmap and return
270 Get the struct page that is at the beginning of the bootmem map
271 Count of pages freed by the bitmap
272-277
For all pages used by the bitmap, free them to the buddy allocator the
same way the previous block of code did
279
Set the bootmem map to NULL to prevent it been freed a second time by
accident
281
Return the total number of pages freed by this function, or in other words,
return the number of pages that were added to the buddy allocator's free lists
Appendix F
Physical Page Allocation
Contents
F.1 Allocating Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
F.1.1
F.1.2
F.1.3
F.1.4
F.1.5
F.1.6
Function:
Function:
Function:
Function:
Function:
Function:
alloc_pages() . . . . . . . . . . . . . . . . . . . . . . 405
F.2.1
F.2.2
F.2.3
F.2.4
F.2.5
Function:
Function:
Function:
Function:
Function:
alloc_page() . . . . . . . . . . . . . . . . . . . . . . . 418
_alloc_pages() . . . . . . . . . . . . . . . . . . . . . 405
__alloc_pages() . . . . . . . . . . . . . . . . . . . . . 406
rmqueue() . . . . . . . . . . . . . . . . . . . . . . . . . 410
expand() . . . . . . . . . . . . . . . . . . . . . . . . . 412
balance_classzone() . . . . . . . . . . . . . . . . . . 414
F.2 Allocation Helper Functions . . . . . . . . . . . . . . . . . . . . . 418
__get_free_page() . . . . . . . . . . . . . . . . . . . 418
__get_free_pages() . . . . . . . . . . . . . . . . . . . 418
__get_dma_pages() . . . . . . . . . . . . . . . . . . . 419
get_zeroed_page() . . . . . . . . . . . . . . . . . . . 419
F.3 Free Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
F.3.1 Function: __free_pages() . . . . . . . . . . . . . . . . . . . . . 420
F.3.2 Function: __free_pages_ok() . . . . . . . . . . . . . . . . . . . 420
F.4 Free Helper Functions . . . . . . . . . . . . . . . . . . . . . . . . . 425
F.4.1 Function: free_pages() . . . . . . . . . . . . . . . . . . . . . . . 425
F.4.2 Function: __free_page() . . . . . . . . . . . . . . . . . . . . . . 425
F.4.3 Function: free_page() . . . . . . . . . . . . . . . . . . . . . . . 425
404
F.1 Allocating Pages
405
F.1 Allocating Pages
Contents
F.1 Allocating Pages
F.1.1 Function: alloc_pages()
F.1.2 Function: _alloc_pages()
F.1.3 Function: __alloc_pages()
F.1.4 Function: rmqueue()
F.1.5 Function: expand()
F.1.6 Function: balance_classzone()
F.1.1
Function: alloc_pages()
405
405
405
406
410
412
414
(include/linux/mm.h)
The call graph for this function is shown at 6.3. It is declared as follows;
439 static inline struct page * alloc_pages(unsigned int gfp_mask,
unsigned int order)
440 {
444
if (order >= MAX_ORDER)
445
return NULL;
446
return _alloc_pages(gfp_mask, order);
447 }
439 The gfp_mask (Get Free Pages) ags tells the allocator how it may behave.
example
GFP_WAIT
For
is not set, the allocator will not block and instead return
NULL if memory is tight. The order is the power of two number of pages to
allocate
444-445 A simple debugging check optimized away at compile time
446 This function is described next
F.1.2
Function: _alloc_pages()
The function
_alloc_pages()
(mm/page_alloc.c)
comes in two varieties. The rst is designed to
mm/page_alloc.c.
second is in mm/numa.c
only work with UMA architectures such as the x86 and is in
It only refers to the static node
contig_page_data.
The
and is a simple extension. It uses a node-local allocation policy which means that
memory will be allocated from the bank closest to the processor. For the purposes
of this book, only the
mm/page_alloc.c version will be examined but developers on
_alloc_pages() and _alloc_pages_pgdat() as
NUMA architectures should read
well in
mm/numa.c
244 #ifndef CONFIG_DISCONTIGMEM
245 struct page *_alloc_pages(unsigned int gfp_mask,
unsigned int order)
246 {
F.1 Allocating Pages (_alloc_pages())
406
247
return __alloc_pages(gfp_mask, order,
248
contig_page_data.node_zonelists+(gfp_mask & GFP_ZONEMASK));
249 }
250 #endif
244
The ifndef is for UMA architectures like the x86. NUMA architectures used
the
_alloc_pages() function in mm/numa.c which employs a node local policy
for allocations
245
The
gfp_mask
ags tell the allocator how it may behave. The
order
is the
power of two number of pages to allocate
247 node_zonelists
is initialised in
is an array of preferred fallback zones to allocate from. It
build_zonelists()(See
gfp_mask indicate
bitmask gfp_mask
Section B.1.6) The lower 16 bits of
what zone is preferable to allocate from.
&
GFP_ZONEMASK
will give the index in
Applying the
node_zonelists
we prefer to allocate from.
F.1.3
Function: __alloc_pages()
(mm/page_alloc.c)
At this stage, we've reached what is described as the heart of the zoned buddy
allocator, the
__alloc_pages()
function. It is responsible for cycling through the
fallback zones and selecting one suitable for the allocation. If memory is tight, it
will take some steps to address the problem. It will wake
it will do the work of
kswapd manually.
kswapd and if necessary
327 struct page * __alloc_pages(unsigned int gfp_mask,
unsigned int order,
zonelist_t *zonelist)
328 {
329
unsigned long min;
330
zone_t **zone, * classzone;
331
struct page * page;
332
int freed;
333
334
zone = zonelist->zones;
335
classzone = *zone;
336
if (classzone == NULL)
337
return NULL;
338
min = 1UL << order;
339
for (;;) {
340
zone_t *z = *(zone++);
341
if (!z)
342
break;
343
344
min += z->pages_low;
345
if (z->free_pages > min) {
F.1 Allocating Pages (__alloc_pages())
346
347
348
349
350
}
}
407
page = rmqueue(z, order);
if (page)
return page;
334 Set zone to be the preferred zone to allocate from
335
The preferred zone is recorded as the classzone.
If one of the pages low
watermarks is reached later, the classzone is marked as needing balance
336-337
An unnecessary sanity check.
build_zonelists()
would need to be
seriously broken for this to happen
338-350
This style of block appears a number of times in this function. It reads
as cycle through all zones in this fallback list and see can the allocation be
satised without violating watermarks.
Note that the
pages_low
for each
fallback zone is added together. This is deliberate to reduce the probability a
fallback zone will be used.
340 z is the zone currently been examined.
The
zone variable is moved to the next
fallback zone
341-342 If this is the last zone in the fallback list, break
344
Increment the number of pages to be allocated by the watermark for easy
comparisons.
This happens for each zone in the fallback zones.
While this
appears rst to be a bug, this behavior is actually intended to reduce the
probability a fallback zone is used.
345-349
Allocate the page block if it can be assigned without reaching the
pages_min
watermark.
rmqueue()(See
Section F.1.4) is responsible from re-
moving the block of pages from the zone
347-348 If the pages could be allocated, return a pointer to them
352
353
354
355
356
357
358
359
360
361
362
363
classzone->need_balance = 1;
mb();
if (waitqueue_active(&kswapd_wait))
wake_up_interruptible(&kswapd_wait);
zone = zonelist->zones;
min = 1UL << order;
for (;;) {
unsigned long local_min;
zone_t *z = *(zone++);
if (!z)
break;
F.1 Allocating Pages (__alloc_pages())
364
365
366
367
368
369
370
371
372
373
374
375
352
}
408
local_min = z->pages_min;
if (!(gfp_mask & __GFP_WAIT))
local_min >>= 2;
min += local_min;
if (z->free_pages > min) {
page = rmqueue(z, order);
if (page)
return page;
}
Mark the preferred zone as needing balance. This ag will be read later by
kswapd
353 This is a memory barrier.
It ensures that all CPU's will see any changes made
to variables before this line of code. This is important because
kswapd could
be running on a dierent processor to the memory allocator.
354-355 Wake up kswapd if it is asleep
357-358 Begin again with the rst preferred zone and min value
360-374 Cycle through all the zones.
allocated without hitting the
This time, allocate the pages if they can be
pages_min
watermark
365 local_min how low a number of free pages this zone can have
366-367 If the process can not wait or reschedule (__GFP_WAIT is clear), then allow
the zone to be put in further memory pressure than the watermark normally
allows
376
/* here we're in the low on memory slow path */
377
378 rebalance:
379
if (current->flags & (PF_MEMALLOC | PF_MEMDIE)) {
380
zone = zonelist->zones;
381
for (;;) {
382
zone_t *z = *(zone++);
383
if (!z)
384
break;
385
386
page = rmqueue(z, order);
387
if (page)
388
return page;
389
}
F.1 Allocating Pages (__alloc_pages())
390
391
}
409
return NULL;
378 This label is returned to after an attempt is made to synchronusly free pages.
From this line on, the low on memory path has been reached. It is likely the
process will sleep
379-391 These two ags are only set by the OOM killer.
As the process is trying
to kill itself cleanly, allocate the pages if at all possible as it is known they will
be freed very soon
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423 }
394-395
/* Atomic allocations - we can't balance anything */
if (!(gfp_mask & __GFP_WAIT))
return NULL;
page = balance_classzone(classzone, gfp_mask, order, &freed);
if (page)
return page;
zone = zonelist->zones;
min = 1UL << order;
for (;;) {
zone_t *z = *(zone++);
if (!z)
break;
}
min += z->pages_min;
if (z->free_pages > min) {
page = rmqueue(z, order);
if (page)
return page;
}
/* Don't let big-order allocations loop */
if (order > 3)
return NULL;
/* Yield for kswapd, and try again */
yield();
goto rebalance;
If the calling process can not sleep, return NULL as the only way to
allocate the pages from here involves sleeping
F.1 Allocating Pages (__alloc_pages())
410
397 balance_classzone()(See Section F.1.6)
a synchronous fashion.
performs the work of
kswapd
in
The principal dierence is that instead of free-
ing the memory into a global pool, it is kept for the process using the
current→local_pages
linked list
398-399 If a page block of the right order has been freed, return it.
Just because
this is NULL does not mean an allocation will fail as it could be a higher order
of pages that was released
403-414 This is identical to the block above.
Allocate the page blocks if it can be
done without hitting the pages_min watermark
417-418
Satising a large allocation like 2
4
number of pages is dicult. If it has
not been satised by now, it is better to simply return NULL
421 Yield the processor to give kswapd a chance to work
422 Attempt to balance the zones again and allocate
F.1.4
Function: rmqueue()
This function is called from
(mm/page_alloc.c)
__alloc_pages(). It
is responsible for nding a
block of memory large enough to be used for the allocation. If a block of memory of
the requested size is not available, it will look for a larger order that may be split into
two buddies. The actual splitting is performed by the
expand() (See
Section F.1.5)
function.
198 static FASTCALL(struct page *rmqueue(zone_t *zone,
unsigned int order));
199 static struct page * rmqueue(zone_t *zone, unsigned int order)
200 {
201
free_area_t * area = zone->free_area + order;
202
unsigned int curr_order = order;
203
struct list_head *head, *curr;
204
unsigned long flags;
205
struct page *page;
206
207
spin_lock_irqsave(&zone->lock, flags);
208
do {
209
head = &area->free_list;
210
curr = head->next;
211
212
if (curr != head) {
213
unsigned int index;
214
215
page = list_entry(curr, struct page, list);
216
if (BAD_RANGE(zone,page))
F.1 Allocating Pages (rmqueue())
217
218
219
220
221
222
223
224
BUG();
list_del(curr);
index = page - zone->zone_mem_map;
if (curr_order != MAX_ORDER-1)
MARK_USED(index, curr_order, area);
zone->free_pages -= 1UL << order;
page = expand(zone, page, index, order,
curr_order, area);
spin_unlock_irqrestore(&zone->lock, flags);
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242 }
199
411
set_page_count(page, 1);
if (BAD_RANGE(zone,page))
BUG();
if (PageLRU(page))
BUG();
if (PageActive(page))
BUG();
return page;
}
curr_order++;
area++;
} while (curr_order < MAX_ORDER);
spin_unlock_irqrestore(&zone->lock, flags);
return NULL;
The parameters are the zone to allocate from and what order of pages are
required
201
Because the
free_area
is an array of linked lists, the order may be used an
an index within the array
207 Acquire the zone lock
208-238
This while block is responsible for nding what order of pages we will
need to allocate from. If there isn't a free block at the order we are interested
in, check the higher blocks until a suitable one is found
209 head is the list of free page blocks for this order
210 curr is the rst block of pages
212-235 If there is a free page block at this order, then allocate it
F.1 Allocating Pages (rmqueue())
412
215 page is set to be a pointer to the rst page in the free block
216-217 Sanity check that checks to make sure the page this page belongs to this
zone and is within the
zone_mem_map.
It is unclear how this could possibly
happen without severe bugs in the allocator itself that would place blocks in
the wrong zones
218 As the block is going to be allocated, remove it from the free list
219 index treats the zone_mem_map as an array of pages so that index will be the
oset within the array
220-221
Toggle the bit that represents this pair of buddies.
MARK_USED()
is a
macro which calculates which bit to toggle
222 Update the statistics for this zone. 1UL<<order is the number of pages been
allocated
224 expand()(See Section F.1.5)
is the function responsible for splitting page
blocks of higher orders
225 No other updates to the zone need to take place so release the lock
227 Show that the page is in use
228-233 Sanity checks
234 Page block has been successfully allocated so return it
236-237
If a page block was not free of the correct order, move to a higher order
of page blocks and see what can be found there
239 No other updates to the zone need to take place so release the lock
241 No page blocks of the requested or higher order are available so return failure
F.1.5
Function: expand()
(mm/page_alloc.c)
This function splits page blocks of higher orders until a page block of the needed
order is available.
177 static inline struct page * expand (zone_t *zone,
struct page *page,
unsigned long index,
int low,
int high,
free_area_t * area)
179 {
180
unsigned long size = 1 << high;
181
F.1 Allocating Pages (expand())
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196 }
413
while (high > low) {
if (BAD_RANGE(zone,page))
BUG();
area--;
high--;
size >>= 1;
list_add(&(page)->list, &(area)->free_list);
MARK_USED(index, high, area);
index += size;
page += size;
}
if (BAD_RANGE(zone,page))
BUG();
return page;
177 The parameters are
zone is where the allocation is coming from
page is the rst page of the block been split
index is the index of page within mem_map
low is the order of pages needed for the allocation
high is the order of pages that is been split for the allocation
area is the free_area_t representing the high order block of pages
180 size is the number of pages in the block that is to be split
182-192 Keep splitting until a block of the needed page order is found
183-184 Sanity check that checks to make sure the page this page belongs to this
zone and is within the
zone_mem_map
185 area is now the next free_area_t representing the lower order of page blocks
186 high is the next order of page blocks to be split
187 The size of the block been split is now half as big
188
Of the pair of buddies, the one lower in the
mem_map
is added to the free list
for the lower order
189 Toggle the bit representing the pair of buddies
190 index now the index of the second buddy of the newly created pair
191 page now points to the second buddy of the newly created paid
F.1 Allocating Pages (expand())
414
193-194 Sanity check
195 The blocks have been successfully split so return the page
F.1.6
Function: balance_classzone()
(mm/page_alloc.c)
This function is part of the direct-reclaim path. Allocators which can sleep will
call this function to start performing the work of
kswapd in a synchronous fashion.
As the process is performing the work itself, the pages it frees of the desired order are
current→local_pages and the number of page blocks
in the list is stored in current→nr_local_pages. Note that page blocks is not the
reserved in a linked list in
same as number of pages. A page block could be of any order.
253 static struct page * balance_classzone(zone_t * classzone,
unsigned int gfp_mask,
unsigned int order,
int * freed)
254 {
255
struct page * page = NULL;
256
int __freed = 0;
257
258
if (!(gfp_mask & __GFP_WAIT))
259
goto out;
260
if (in_interrupt())
261
BUG();
262
263
current->allocation_order = order;
264
current->flags |= PF_MEMALLOC | PF_FREE_PAGES;
265
266
__freed = try_to_free_pages_zone(classzone, gfp_mask);
267
268
current->flags &= ~(PF_MEMALLOC | PF_FREE_PAGES);
269
258-259
tion.
If the caller is not allowed to sleep, then goto
__alloc_pages()
260-261
out
to exit the func-
For this to occur, the function would have to be called directly or
would need to be deliberately broken
This function may not be used by interrupts. Again, deliberate damage
would have to be introduced for this condition to occur
263
Record the desired size of the allocation in
current→allocation_order.
This is actually unused although it could have been used to only add pages of
local_pages
page→index
the desired order to the
list is stored in
list. As it is, the order of pages in the
264 Set the ags which will the free functions to add the pages to the local_list
F.1 Allocating Pages (balance_classzone())
266
Free pages directly from the desired zone with
(See Section J.5.3).
kswapd
268
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
try_to_free_pages_zone()
This is where the direct-reclaim path intersects with
Clear the ags again so that the free functions do not continue to add pages
to the
270
271
272
273
274
275
276
277
278
279
280
281
282
415
local_pages
list
if (current->nr_local_pages) {
struct list_head * entry, * local_pages;
struct page * tmp;
int nr_pages;
local_pages = &current->local_pages;
if (likely(__freed)) {
/* pick from the last inserted so we're lifo */
entry = local_pages->next;
do {
tmp = list_entry(entry, struct page, list);
if (tmp->index == order &&
memclass(page_zone(tmp), classzone)) {
list_del(entry);
current->nr_local_pages--;
set_page_count(tmp, 1);
page = tmp;
if (page->buffers)
BUG();
if (page->mapping)
BUG();
if (!VALID_PAGE(page))
BUG();
if (PageLocked(page))
BUG();
if (PageLRU(page))
BUG();
if (PageActive(page))
BUG();
if (PageDirty(page))
BUG();
break;
}
}
} while ((entry = entry->next) != local_pages);
F.1 Allocating Pages (balance_classzone())
Presuming that pages exist in the
416
local_pages
list, this function will cycle
through the list looking for a page block belonging to the desired zone and order.
270 Only enter this block if pages are stored in the local list
275 Start at the beginning of the list
277 If pages were freed with try_to_free_pages_zone() then...
279
The last one inserted is chosen rst as it is likely to be cache hot and it is
desirable to use pages that have been recently referenced
280-305 Cycle through the pages in the list until we nd one of the desired order
and zone
281 Get the page from this list entry
282
The order of the page block is stored in
page→index
so check if the order
matches the desired order and that it belongs to the right zone. It is unlikely
that pages from another zone are on this list but it could occur if
swap_out()
is called to free pages directly from process page tables
283 This is a page of the right order and zone so remove it from the list
284 Decrement the number of page blocks in the list
285 Set the page count to 1 as it is about to be freed
286
page
Set
as it will be returned.
tmp
is needed for the next block for freeing
the remaining pages in the local list
288-301
Perform the same checks that are performed in
__free_pages_ok()
to
ensure it is safe to free this page
305
Move to the next page in the list if the current one was not of the desired
order and zone
308
309
310
311
312
313
314
315
316
317
318
}
nr_pages = current->nr_local_pages;
/* free in reverse order so that the global
* order will be lifo */
while ((entry = local_pages->prev) != local_pages) {
list_del(entry);
tmp = list_entry(entry, struct page, list);
__free_pages_ok(tmp, tmp->index);
if (!nr_pages--)
BUG();
}
current->nr_local_pages = 0;
F.1 Allocating Pages (balance_classzone())
417
319 out:
320
*freed = __freed;
321
return page;
322 }
This block frees the remaining pages in the list.
308 Get the number of page blocks that are to be freed
310 Loop until the local_pages list is empty
311 Remove this page block from the list
312 Get the struct page for the entry
313 Free the page with __free_pages_ok() (See Section F.3.2)
314-315 If the count of page blocks reaches zero and there is still pages in the list,
it means that the accounting is seriously broken somewhere or that someone
added pages to the
local_pages
list manually so call
BUG()
317 Set the number of page blocks to 0 as they have all been freed
320
Update the
freed
parameter to tell the caller how many pages were freed in
total
321
Return the page block of the requested order and zone. If the freeing failed,
this will be returning NULL
F.2 Allocation Helper Functions
418
F.2 Allocation Helper Functions
Contents
F.2 Allocation Helper Functions
F.2.1 Function: alloc_page()
F.2.2 Function: __get_free_page()
F.2.3 Function: __get_free_pages()
F.2.4 Function: __get_dma_pages()
F.2.5 Function: get_zeroed_page()
418
418
418
418
419
419
This section will cover miscellaneous helper functions and macros the Buddy Allocator uses to allocate pages. Very few of them do real work and are available just
for the convenience of the programmer.
F.2.1
Function: alloc_page()
This trivial macro just calls
(include/linux/mm.h)
alloc_pages() with an order of 0 to return 1 page.
It is declared as follows
449 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
F.2.2
Function: __get_free_page()
This trivial function calls
(include/linux/mm.h)
__get_free_pages() with an order
of 0 to return 1
page. It is declared as follows
454 #define __get_free_page(gfp_mask) \
455
__get_free_pages((gfp_mask),0)
F.2.3
Function: __get_free_pages()
(mm/page_alloc.c)
This function is for callers who do not want to worry about pages and only get
back an address it can use. It is declared as follows
428 unsigned long __get_free_pages(unsigned int gfp_mask,
unsigned int order)
428 {
430
struct page * page;
431
432
page = alloc_pages(gfp_mask, order);
433
if (!page)
434
return 0;
435
return (unsigned long) page_address(page);
436 }
431 alloc_pages() does the work of allocating the page block.
433-434
Make sure the page is valid
435 page_address() returns the physical address of the page
See Section F.1.1
F.2.4 Function: __get_dma_pages()
F.2.4
Function: __get_dma_pages()
419
(include/linux/mm.h)
This is of principle interest to device drivers.
ZONE_DMA
It will return memory from
suitable for use with DMA devices. It is declared as follows
457 #define __get_dma_pages(gfp_mask, order) \
458
__get_free_pages((gfp_mask) | GFP_DMA,(order))
458
gfp_mask
ZONE_DMA
F.2.5
The
is or-ed with
GFP_DMA
Function: get_zeroed_page()
to tell the allocator to allocate from
(mm/page_alloc.c)
This function will allocate one page and then zero out the contents of it. It is
declared as follows
438 unsigned long get_zeroed_page(unsigned int gfp_mask)
439 {
440
struct page * page;
441
442
page = alloc_pages(gfp_mask, 0);
443
if (page) {
444
void *address = page_address(page);
445
clear_page(address);
446
return (unsigned long) address;
447
}
448
return 0;
449 }
438 gfp_mask are the ags which aect allocator behaviour.
442 alloc_pages() does the work of allocating the page block.
444 page_address() returns the physical address of the page
445 clear_page() will ll the contents of a page with zero
446 Return the address of the zeroed page
See Section F.1.1
F.3 Free Pages
420
F.3 Free Pages
Contents
F.3 Free Pages
F.3.1 Function: __free_pages()
F.3.2 Function: __free_pages_ok()
F.3.1
Function: __free_pages()
420
420
420
(mm/page_alloc.c)
The call graph for this function is shown in Figure 6.4.
ing, the opposite to
alloc_pages()
is not
free_pages(),
Just to be confus-
it is
__free_pages().
free_pages() is a helper function which takes an address as a parameter, it will be
discussed in a later section.
451 void __free_pages(struct page *page, unsigned int order)
452 {
453
if (!PageReserved(page) && put_page_testzero(page))
454
__free_pages_ok(page, order);
455 }
451 The parameters are the page we wish to free and what order block it is
453
PageReserved() indicates that the page is reserved by
the boot memory allocator. put_page_testzero() is just a macro wrapper
around atomic_dec_and_test() decrements the usage count and makes sure
Sanity checked.
it is zero
454 Call the function that does all the hard work
F.3.2
Function: __free_pages_ok()
(mm/page_alloc.c)
This function will do the actual freeing of the page and coalesce the buddies if
possible.
81 static void FASTCALL(__free_pages_ok (struct page *page,
unsigned int order));
82 static void __free_pages_ok (struct page *page, unsigned int order)
83 {
84
unsigned long index, page_idx, mask, flags;
85
free_area_t *area;
86
struct page *base;
87
zone_t *zone;
88
93
if (PageLRU(page)) {
94
if (unlikely(in_interrupt()))
95
BUG();
96
lru_cache_del(page);
97
}
F.3 Free Pages (__free_pages_ok())
98
99
100
101
102
103
104
105
106
107
108
109
82
421
if (page->buffers)
BUG();
if (page->mapping)
BUG();
if (!VALID_PAGE(page))
BUG();
if (PageLocked(page))
BUG();
if (PageActive(page))
BUG();
page->flags &= ~((1<<PG_referenced) | (1<<PG_dirty));
The parameters are the beginning of the page block to free and what order
number of pages are to be freed.
93-97 A dirty page on the LRU will still have the LRU bit set when pinned for IO.
On IO completion, it is freed so it must now be removed from the LRU list
99-108 Sanity checks
109 The ags showing a page has being referenced and is dirty have to be cleared
because the page is now free and not in use
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
if (current->flags & PF_FREE_PAGES)
goto local_freelist;
back_local_freelist:
zone = page_zone(page);
mask = (~0UL) << order;
base = zone->zone_mem_map;
page_idx = page - base;
if (page_idx & ~mask)
BUG();
index = page_idx >> (1 + order);
area = zone->free_area + order;
111-112
If this ag is set, the pages freed are to be kept for the process doing
the freeing.
This is set by
balance_classzone()(See
Section F.1.6) during
page allocation if the caller is freeing the pages itself rather than waiting for
kswapd to do the work
F.3 Free Pages (__free_pages_ok())
115
117
422
The zone the page belongs to is encoded within the page ags.
page_zone()
The
macro returns the zone
The calculation of mask is discussed in companion document. It is basically
related to the address calculation of the buddy
118 base
is the beginning of this
zone_mem_map.
For the buddy calculation to
work, it was to be relative to an address 0 so that the addresses will be a
power of two
119 page_idx
treats the
zone_mem_map
as an array of pages.
This is the index
page within the map
120-121
If the index is not the proper power of two, things are severely broken
and calculation of the buddy will not work
122 This index is the bit index within free_area→map
124 area
is the area storing the free lists and map for the order block the pages
are been freed from.
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
spin_lock_irqsave(&zone->lock, flags);
zone->free_pages -= mask;
while (mask + (1 << (MAX_ORDER-1))) {
struct page *buddy1, *buddy2;
if (area >= zone->free_area + MAX_ORDER)
BUG();
if (!__test_and_change_bit(index, area->map))
/*
* the buddy page is still allocated.
*/
break;
/*
* Move the buddy up one level.
* This code is taking advantage of the identity:
*
-mask = 1+~mask
*/
buddy1 = base + (page_idx ^ -mask);
buddy2 = base + page_idx;
if (BAD_RANGE(zone,buddy1))
BUG();
if (BAD_RANGE(zone,buddy2))
BUG();
F.3 Free Pages (__free_pages_ok())
152
153
154
155
156
157
}
423
list_del(&buddy1->list);
mask <<= 1;
area++;
index >>= 1;
page_idx &= mask;
126 The zone is about to be altered so take out the lock.
The lock is an interrupt
safe lock as it is possible for interrupt handlers to allocate a page in this path
128
Another side eect of the calculation of
mask
is that
-mask
is the number of
pages that are to be freed
130-157
The allocator will keep trying to coalesce blocks together until it either
cannot merge or reaches the highest order that can be merged. mask will be
adjusted for each order block that is merged. When the highest order that can
be merged is reached, this while loop will evaluate to 0 and exit.
133-134
If by some miracle,
free_area
mask
is corrupt, this check will make sure the
array will not not be read beyond the end
135 Toggle the bit representing this pair of buddies.
If the bit was previously zero,
both buddies were in use. As this buddy is been freed, one is still in use and
cannot be merged
145-146 The calculation of the two addresses is discussed in Chapter 6
147-150 Sanity check to make sure the pages are within the correct zone_mem_map
and actually belong to this zone
152 The buddy has been freed so remove it from any list it was part of
153-156 Prepare to examine the higher order buddy for merging
153 Move the mask one bit to the left for order 2k+1
154 area is a pointer within an array so area++ moves to the next index
155 The index in the bitmap of the higher order
156 The page index within the zone_mem_map for the buddy to merge
158
159
160
161
162
163
164
list_add(&(base + page_idx)->list, &area->free_list);
spin_unlock_irqrestore(&zone->lock, flags);
return;
local_freelist:
if (current->nr_local_pages)
F.3 Free Pages (__free_pages_ok())
165
166
167
168
169
170
171
172 }
424
goto back_local_freelist;
if (in_interrupt())
goto back_local_freelist;
list_add(&page->list, &current->local_pages);
page->index = order;
current->nr_local_pages++;
158 As much merging as possible as completed and a new page block is free so add
it to the
free_list
for this order
160-161 Changes to the zone is complete so free the lock and return
163 This is the code path taken when the pages are not freed to the main pool but
instaed are reserved for the process doing the freeing.
164-165 If the process already has reserved pages, it is not allowed to reserve any
more so return back. This is unusual as
balance_classzone()
assumes that
more than one page block may be returned on this list. It is likely to be an
oversight but may still work if the rst page block freed is the same order and
zone as required by
balance_classzone()
166-167 An interrupt does not have process context so it has to free in the normal
fashion. It is unclear how an interrupt could end up here at all. This check is
likely to be bogus and impossible to be true
169 Add the page block to the list for the processes local_pages
170 Record what order allocation it was for freeing later
171 Increase the use count for nr_local_pages
F.4 Free Helper Functions
425
F.4 Free Helper Functions
Contents
F.4 Free Helper Functions
F.4.1 Function: free_pages()
F.4.2 Function: __free_page()
F.4.3 Function: free_page()
425
425
425
425
These functions are very similar to the page allocation helper functions in that
they do no real work themselves and depend on the
__free_pages()
function to
perform the actual free.
F.4.1
Function: free_pages()
(mm/page_alloc.c)
This function takes an address instead of a page as a parameter to free. It is
declared as follows
457 void free_pages(unsigned long addr, unsigned int order)
458 {
459
if (addr != 0)
460
__free_pages(virt_to_page(addr), order);
461 }
460
The function is discussed in Section F.3.1.
returns the
F.4.2
struct page
for the
Function: __free_page()
This trivial macro just calls the
The macro
addr
(include/linux/mm.h)
function __free_pages()
virt_to_page()
(See Section F.3.1)
with an order 0 for 1 page. It is declared as follows
472 #define __free_page(page) __free_pages((page), 0)
F.4.3
Function: free_page()
(include/linux/mm.h)
This trivial macro just calls the function free_pages(). The essential dierence
between this macro and __free_page() is that this function takes a virtual address
as a parameter and __free_page() takes a struct page.
472 #define free_page(addr) free_pages((addr),0)
Appendix G
Non-Contiguous Memory Allocation
Contents
G.1 Allocating A Non-Contiguous Area . . . . . . . . . . . . . . . . . 427
G.1.1
G.1.2
G.1.3
G.1.4
G.1.5
G.1.6
G.1.7
G.1.8
Function:
Function:
Function:
Function:
Function:
Function:
Function:
Function:
G.2.1
G.2.2
G.2.3
G.2.4
Function:
Function:
Function:
Function:
vmalloc() . . . . . . . . . . . . . . . . . . . . . . . . . 427
__vmalloc() . . . . . . . . . . . . . . . . . . . . . . . 427
get_vm_area() . . . . . . . . . . . . . . . . . . . . . . 428
vmalloc_area_pages() . . . . . . . . . . . . . . . . . 430
__vmalloc_area_pages() . . . . . . . . . . . . . . . . 431
alloc_area_pmd() . . . . . . . . . . . . . . . . . . . . 432
alloc_area_pte() . . . . . . . . . . . . . . . . . . . . 434
vmap() . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
G.2 Freeing A Non-Contiguous Area . . . . . . . . . . . . . . . . . . 437
vfree() . . . . . . . . . . . . . . . . . . . . . . . . . . 437
vmfree_area_pages() . . . . . . . . . . . . . . . . . . 438
free_area_pmd() . . . . . . . . . . . . . . . . . . . . . 439
free_area_pte() . . . . . . . . . . . . . . . . . . . . . 440
426
G.1 Allocating A Non-Contiguous Area
427
G.1 Allocating A Non-Contiguous Area
Contents
G.1 Allocating A Non-Contiguous Area
G.1.1 Function: vmalloc()
G.1.2 Function: __vmalloc()
G.1.3 Function: get_vm_area()
G.1.4 Function: vmalloc_area_pages()
G.1.5 Function: __vmalloc_area_pages()
G.1.6 Function: alloc_area_pmd()
G.1.7 Function: alloc_area_pte()
G.1.8 Function: vmap()
G.1.1
Function: vmalloc()
427
427
427
428
430
431
432
434
435
(include/linux/vmalloc.h)
The call graph for this function is shown in Figure 7.2. The following macros
only by their
GFP_
__vmalloc()(See
37
38
39
40
45
46
47
48
49
54
55
56
57
58
ags (See Section 6.4). The size parameter is page aligned by
Section G.1.2).
static inline void * vmalloc (unsigned long size)
{
return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL);
}
static inline void * vmalloc_dma (unsigned long size)
{
return __vmalloc(size, GFP_KERNEL|GFP_DMA, PAGE_KERNEL);
}
static inline void * vmalloc_32(unsigned long size)
{
return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL);
}
37 The ags indicate that to use either ZONE_NORMAL or ZONE_HIGHMEM as necessary
46 The ag indicates to only allocate from ZONE_DMA
55 Only physical pages from ZONE_NORMAL will be allocated
G.1.2
Function: __vmalloc()
(mm/vmalloc.c)
This function has three tasks. It page aligns the size request, asks
to nd an area for the request and uses
get_vm_area()
vmalloc_area_pages() to allocate the PTEs
for the pages.
261 void * __vmalloc (unsigned long size, int gfp_mask, pgprot_t prot)
262 {
G.1 Allocating A Non-Contiguous Area (__vmalloc())
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279 }
261
428
void * addr;
struct vm_struct *area;
size = PAGE_ALIGN(size);
if (!size || (size >> PAGE_SHIFT) > num_physpages)
return NULL;
area = get_vm_area(size, VM_ALLOC);
if (!area)
return NULL;
addr = area->addr;
if (__vmalloc_area_pages(VMALLOC_VMADDR(addr), size, gfp_mask,
prot, NULL)) {
vfree(addr);
return NULL;
}
return addr;
The parameters are the size to allocate, the
GFP_
ags to use for allocation
and what protection to give the PTE
266 Align the size to a page size
267
Sanity check. Make sure the size is not 0 and that the size requested is not
larger than the number of physical pages has been requested
269 Find an area of virtual address space to store the allocation with get_vm_area()
(See Section G.1.3)
272 The addr eld has been lled by get_vm_area()
273 Allocate the PTE entries needed for the allocation with __vmalloc_area_pages()
(See Section G.1.5). If it fails, a non-zero value
-ENOMEM
is returned
275-276 If the allocation fails, free any PTEs, pages and descriptions of the area
278 Return the address of the allocated area
G.1.3
Function: get_vm_area()
(mm/vmalloc.c)
vm_struct, the slab allocator is asked to provide
the necessary memory via kmalloc(). It then searches the vm_struct list linearaly
To allocate an area for the
looking for a region large enough to satisfy a request, including a page pad at the
end of the area.
195 struct vm_struct * get_vm_area(unsigned long size,
unsigned long flags)
196 {
G.1 Allocating A Non-Contiguous Area (get_vm_area())
429
197
unsigned long addr, next;
198
struct vm_struct **p, *tmp, *area;
199
200
area = (struct vm_struct *) kmalloc(sizeof(*area), GFP_KERNEL);
201
if (!area)
202
return NULL;
203
204
size += PAGE_SIZE;
205
if(!size) {
206
kfree (area);
207
return NULL;
208
}
209
210
addr = VMALLOC_START;
211
write_lock(&vmlist_lock);
212
for (p = &vmlist; (tmp = *p) ; p = &tmp->next) {
213
if ((size + addr) < addr)
214
goto out;
215
if (size + addr <= (unsigned long) tmp->addr)
216
break;
217
next = tmp->size + (unsigned long) tmp->addr;
218
if (next > addr)
219
addr = next;
220
if (addr > VMALLOC_END-size)
221
goto out;
222
}
223
area->flags = flags;
224
area->addr = (void *)addr;
225
area->size = size;
226
area->next = *p;
227
*p = area;
228
write_unlock(&vmlist_lock);
229
return area;
230
231 out:
232
write_unlock(&vmlist_lock);
233
kfree(area);
234
return NULL;
235 }
195 The parameters is the size of the requested region which should be a multiple
of the page size and the area ags, either
VM_ALLOC
or
VM_IOREMAP
200-202 Allocate space for the vm_struct description struct
204 Pad the request so there is a page gap between areas.
This is to guard against
G.1 Allocating A Non-Contiguous Area (get_vm_area())
430
overwrites
205-206
This is to ensure the size is not 0 after the padding due to an overow.
If something does go wrong, free the
area
just allocated and return NULL
210 Start the search at the beginning of the vmalloc address space
211 Lock the list
212-222 Walk through the list searching for an area large enough for the request
213-214 Check to make sure the end of the addressable range has not been reached
215-216 If the requested area would t between the current address and the next
area, the search is complete
217 Make sure the address would not go over the end of the vmalloc address space
223-225 Copy in the area information
226-227 Link the new area into the list
228-229 Unlock the list and return
231 This label is reached if the request could not be satised
232 Unlock the list
233-234 Free the memory used for the area descriptor and return
G.1.4
Function: vmalloc_area_pages()
This is just a wrapper around
(mm/vmalloc.c)
__vmalloc_area_pages(). This
function exists
for compatibility with older kernels. The name change was made to reect that the
new function
__vmalloc_area_pages()
is able to take an array of pages to use for
insertion into the pagetables.
189 int vmalloc_area_pages(unsigned long address, unsigned long size,
190
int gfp_mask, pgprot_t prot)
191 {
192
return __vmalloc_area_pages(address, size, gfp_mask, prot, NULL);
193 }
192
Call
__vmalloc_area_pages()
with the same parameters. The
is passed as NULL as the pages will be allocated as necessary
pages
array
G.1.5 Function: __vmalloc_area_pages()
G.1.5
Function: __vmalloc_area_pages()
431
(mm/vmalloc.c)
This is the beginning of a standard page table walk function.
This top level
function will step through all PGDs within an address range. For each PGD, it will
call
pmd_alloc()
to allocate a PMD directory and call
alloc_area_pmd()
for the
directory.
155 static inline int __vmalloc_area_pages (unsigned long address,
156
unsigned long size,
157
int gfp_mask,
158
pgprot_t prot,
159
struct page ***pages)
160 {
161
pgd_t * dir;
162
unsigned long end = address + size;
163
int ret;
164
165
dir = pgd_offset_k(address);
166
spin_lock(&init_mm.page_table_lock);
167
do {
168
pmd_t *pmd;
169
170
pmd = pmd_alloc(&init_mm, dir, address);
171
ret = -ENOMEM;
172
if (!pmd)
173
break;
174
175
ret = -ENOMEM;
176
if (alloc_area_pmd(pmd, address, end - address,
gfp_mask, prot, pages))
177
break;
178
179
address = (address + PGDIR_SIZE) & PGDIR_MASK;
180
dir++;
181
182
ret = 0;
183
} while (address && (address < end));
184
spin_unlock(&init_mm.page_table_lock);
185
flush_cache_all();
186
return ret;
187 }
155 The parameters are;
address is the starting address to allocate PMDs for
size is the size of the region
G.1 Allocating A Non-Contiguous Area (__vmalloc_area_pages())
432
gfp_mask is the GFP_ ags for alloc_pages() (See Section F.1.1)
prot is the protection to give the PTE entry
pages is an array of pages to use for insertion instead of having alloc_area_pte()
allocate them one at a time. Only the
vmap() interface passes in an array
162 The end address is the starting address plus the size
165 Get the PGD entry for the starting address
166 Lock the kernel reference page table
167-183 For every PGD within this address range, allocate a PMD directory and
call
alloc_area_pmd()
(See Section G.1.6)
170 Allocate a PMD directory
176 Call alloc_area_pmd() (See Section G.1.6) which will allocate a PTE for each
PTE slot in the PMD
179 address becomes the base address of the next PGD entry
180 Move dir to the next PGD entry
184 Release the lock to the kernel page table
185 flush_cache_all() will ush all CPU caches.
This is necessary because the
kernel page tables have changed
186 Return success
G.1.6
Function: alloc_area_pmd()
(mm/vmalloc.c)
This is the second stage of the standard page table walk to allocate PTE entries
for an address range.
pte_alloc()
For every PMD within a given address range on a PGD,
will creates a PTE directory and then
alloc_area_pte()
will be
called to allocate the physical pages
132 static inline int alloc_area_pmd(pmd_t * pmd, unsigned long address,
133
unsigned long size, int gfp_mask,
134
pgprot_t prot, struct page ***pages)
135 {
136
unsigned long end;
137
138
address &= ~PGDIR_MASK;
139
end = address + size;
140
if (end > PGDIR_SIZE)
141
end = PGDIR_SIZE;
142
do {
G.1 Allocating A Non-Contiguous Area (alloc_area_pmd())
143
144
145
146
147
148
149
150
151
152
152 }
433
pte_t * pte = pte_alloc(&init_mm, pmd, address);
if (!pte)
return -ENOMEM;
if (alloc_area_pte(pte, address, end - address,
gfp_mask, prot, pages))
return -ENOMEM;
address = (address + PMD_SIZE) & PMD_MASK;
pmd++;
} while (address < end);
return 0;
132 The parameters are;
pmd is the PMD that needs the allocations
address is the starting address to start from
size is the size of the region within the PMD to allocate for
gfp_mask is the GFP_ ags for alloc_pages() (See Section F.1.1)
prot is the protection to give the PTE entry
pages is an optional array of pages to use instead of allocating each
page
individually
138 Align the starting address to the PGD
139-141
Calculate end to be the end of the allocation or the end of the PGD,
whichever occurs rst
142-151 For every PMD within the given address range, allocate a PTE directory
and call
alloc_area_pte()(See
Section G.1.7)
143 Allocate the PTE directory
146-147 Call alloc_area_pte() which will allocate the physical pages if an array
of pages is not already supplied with
pages
149 address becomes the base address of the next PMD entry
150 Move pmd to the next PMD entry
152 Return success
G.1.7 Function: alloc_area_pte()
G.1.7
Function: alloc_area_pte()
434
(mm/vmalloc.c)
This is the last stage of the page table walk. For every PTE in the given PTE
directory and address range, a page will be allocated and associated with the PTE.
95 static inline int alloc_area_pte (pte_t * pte, unsigned long address,
96
unsigned long size, int gfp_mask,
97
pgprot_t prot, struct page ***pages)
98 {
99
unsigned long end;
100
101
address &= ~PMD_MASK;
102
end = address + size;
103
if (end > PMD_SIZE)
104
end = PMD_SIZE;
105
do {
106
struct page * page;
107
108
if (!pages) {
109
spin_unlock(&init_mm.page_table_lock);
110
page = alloc_page(gfp_mask);
111
spin_lock(&init_mm.page_table_lock);
112
} else {
113
page = (**pages);
114
(*pages)++;
115
116
/* Add a reference to the page so we can free later */
117
if (page)
118
atomic_inc(&page->count);
119
120
}
121
if (!pte_none(*pte))
122
printk(KERN_ERR "alloc_area_pte: page already exists\n");
123
if (!page)
124
return -ENOMEM;
125
set_pte(pte, mk_pte(page, prot));
126
address += PAGE_SIZE;
127
pte++;
128
} while (address < end);
129
return 0;
130 }
101 Align the address to a PMD directory
103-104
The end address is the end of the request or the end of the directory,
whichever occurs rst
G.1 Allocating A Non-Contiguous Area (alloc_area_pte())
105-128
Loop through every PTE in this page. If a
pages
435
array is supplied, use
pages from it to populate the table, otherwise allocate each one individually
108-111 If an array of pages is not supplied, unlock the kernel reference pagetable,
allocate a page with
alloc_page()
and reacquire the spinlock
112-120 Else, take one page from the array and increment it's usage count as it is
about to be inserted into the reference page table
121-122 If the PTE is already in use, it means that the areas in the vmalloc region
are overlapping somehow
123-124 Return failure if physical pages are not available
125 Set the page with the desired protection bits (prot) into the PTE
126 address becomes the address of the next PTE
127 Move to the next PTE
129 Return success
G.1.8
Function: vmap()
(mm/vmalloc.c)
This function allows a caller-supplied array of
vmalloc address space.
pages
to be inserted into the
This is unused in 2.4.22 and I suspect it is an accidental
backport from 2.6.x where it is used by the sound subsystem core.
281 void * vmap(struct page **pages, int count,
282
unsigned long flags, pgprot_t prot)
283 {
284
void * addr;
285
struct vm_struct *area;
286
unsigned long size = count << PAGE_SHIFT;
287
288
if (!size || size > (max_mapnr << PAGE_SHIFT))
289
return NULL;
290
area = get_vm_area(size, flags);
291
if (!area) {
292
return NULL;
293
}
294
addr = area->addr;
295
if (__vmalloc_area_pages(VMALLOC_VMADDR(addr), size, 0,
296
prot, &pages)) {
297
vfree(addr);
298
return NULL;
299
}
300
return addr;
301 }
G.1 Allocating A Non-Contiguous Area (vmap())
436
281 The parameters are;
pages is the caller-supplied array of pages to insert
count is the number of pages in the array
ags is the ags to use for the vm_struct
prot is the protection bits to set the PTE with
286 Calculate the size in bytes of the region to create based on the size of the array
288-289 Make sure the size of the region does not exceed limits
290-293 Use get_vm_area() to nd a region large enough for the mapping.
If one
is not found, return NULL
294 Get the virtual address of the area
295 Insert the array into the pagetable with __vmalloc_area_pages() (See Section G.1.4)
297 If the insertion fails, free the region and return NULL
298 Return the virtual address of the newly mapped region
G.2 Freeing A Non-Contiguous Area
437
G.2 Freeing A Non-Contiguous Area
Contents
G.2 Freeing A Non-Contiguous Area
G.2.1 Function: vfree()
G.2.2 Function: vmfree_area_pages()
G.2.3 Function: free_area_pmd()
G.2.4 Function: free_area_pte()
G.2.1
Function: vfree()
437
437
438
439
440
(mm/vmalloc.c)
The call graph for this function is shown in Figure 7.4.
This is the top level
function responsible for freeing a non-contiguous area of memory. It performs basic
sanity checks before nding the
calls
vmfree_area_pages().
vm_struct
for the requested
addr.
Once found, it
237 void vfree(void * addr)
238 {
239
struct vm_struct **p, *tmp;
240
241
if (!addr)
242
return;
243
if ((PAGE_SIZE-1) & (unsigned long) addr) {
244
printk(KERN_ERR
"Trying to vfree() bad address (%p)\n", addr);
245
return;
246
}
247
write_lock(&vmlist_lock);
248
for (p = &vmlist ; (tmp = *p) ; p = &tmp->next) {
249
if (tmp->addr == addr) {
250
*p = tmp->next;
251
vmfree_area_pages(VMALLOC_VMADDR(tmp->addr),
tmp->size);
252
write_unlock(&vmlist_lock);
253
kfree(tmp);
254
return;
255
}
256
}
257
write_unlock(&vmlist_lock);
258
printk(KERN_ERR
"Trying to vfree() nonexistent vm area (%p)\n", addr);
259 }
237 The parameter is the address returned by get_vm_area() (See Section G.1.3)
to either
vmalloc()
or
ioremap()
241-243 Ignore NULL addresses
G.2 Freeing A Non-Contiguous Area (vfree())
243-246
438
This checks the address is page aligned and is a reasonable quick guess
to see if the area is valid or not
247 Acquire a write lock to the vmlist
248 Cycle through the vmlist looking for the correct vm_struct for addr
249 If this it the correct address then ...
250 Remove this area from the vmlist linked list
251 Free all pages associated with the address range
252 Release the vmlist lock
253 Free the memory used for the vm_struct and return
257-258
The
vm_struct
was not found.
Release the lock and print a message
about the failed free
G.2.2
Function: vmfree_area_pages()
(mm/vmalloc.c)
This is the rst stage of the page table walk to free all pages and PTEs associated
with an address range. It is responsible for stepping through the relevant PGDs and
for ushing the TLB.
80 void vmfree_area_pages(unsigned long address, unsigned long size)
81 {
82
pgd_t * dir;
83
unsigned long end = address + size;
84
85
dir = pgd_offset_k(address);
86
flush_cache_all();
87
do {
88
free_area_pmd(dir, address, end - address);
89
address = (address + PGDIR_SIZE) & PGDIR_MASK;
90
dir++;
91
} while (address && (address < end));
92
flush_tlb_all();
93 }
80 The parameters are the starting address and the size of the region
82 The address space end is the starting address plus its size
85 Get the rst PGD for the address range
86 Flush the cache CPU so cache hits will not occur on pages that are to be deleted.
This is a null operation on many architectures including the x86
G.2 Freeing A Non-Contiguous Area (vmfree_area_pages())
87
Call
free_area_pmd()(See
439
Section G.2.3) to perform the second stage of the
page table walk
89 address becomes the starting address of the next PGD
90 Move to the next PGD
92 Flush the TLB as the page tables have now changed
G.2.3
Function: free_area_pmd()
(mm/vmalloc.c)
This is the second stage of the page table walk. For every PMD in this directory,
call
free_area_pte()
to free up the pages and PTEs.
56 static inline void free_area_pmd(pgd_t * dir,
unsigned long address,
unsigned long size)
57 {
58
pmd_t * pmd;
59
unsigned long end;
60
61
if (pgd_none(*dir))
62
return;
63
if (pgd_bad(*dir)) {
64
pgd_ERROR(*dir);
65
pgd_clear(dir);
66
return;
67
}
68
pmd = pmd_offset(dir, address);
69
address &= ~PGDIR_MASK;
70
end = address + size;
71
if (end > PGDIR_SIZE)
72
end = PGDIR_SIZE;
73
do {
74
free_area_pte(pmd, address, end - address);
75
address = (address + PMD_SIZE) & PMD_MASK;
76
pmd++;
77
} while (address < end);
78 }
56
The parameters are the PGD been stepped through, the starting address and
the length of the region
61-62 If there is no PGD, return.
This can occur after
vfree() (See
Section G.2.1)
is called during a failed allocation
63-67
A PGD can be bad if the entry is not present, it is marked read-only or it
is marked accessed or dirty
G.2 Freeing A Non-Contiguous Area (free_area_pmd())
440
68 Get the rst PMD for the address range
69 Make the address PGD aligned
70-72 end is either the end of the space to free or the end of this PGD, whichever
is rst
73-77 For every PMD, call free_area_pte() (See Section G.2.4) to free the PTE
entries
75 address is the base address of the next PMD
76 Move to the next PMD
G.2.4
Function: free_area_pte()
(mm/vmalloc.c)
This is the nal stage of the page table walk. For every PTE in the given PMD
within the address range, it will free the PTE and the associated page
22 static inline void free_area_pte(pmd_t * pmd, unsigned long address,
unsigned long size)
23 {
24
pte_t * pte;
25
unsigned long end;
26
27
if (pmd_none(*pmd))
28
return;
29
if (pmd_bad(*pmd)) {
30
pmd_ERROR(*pmd);
31
pmd_clear(pmd);
32
return;
33
}
34
pte = pte_offset(pmd, address);
35
address &= ~PMD_MASK;
36
end = address + size;
37
if (end > PMD_SIZE)
38
end = PMD_SIZE;
39
do {
40
pte_t page;
41
page = ptep_get_and_clear(pte);
42
address += PAGE_SIZE;
43
pte++;
44
if (pte_none(page))
45
continue;
46
if (pte_present(page)) {
47
struct page *ptpage = pte_page(page);
48
if (VALID_PAGE(ptpage) &&
G.2 Freeing A Non-Contiguous Area (free_area_pte())
(!PageReserved(ptpage)))
__free_page(ptpage);
continue;
49
50
51
52
53
54 }
22
441
}
printk(KERN_CRIT
"Whee.. Swapped out page in kernel page table\n");
} while (address < end);
The parameters are the PMD that PTEs are been freed from, the starting
address and the size of the region to free
27-28 The PMD could be absent if this region is from a failed vmalloc()
29-33 A PMD can be bad if it's not in main memory, it's read only or it's marked
dirty or accessed
34 pte is the rst PTE in the address range
35 Align the address to the PMD
36-38
The end is either the end of the requested region or the end of the PMD,
whichever occurs rst
38-53 Step through all PTEs, perform checks and free the PTE with its associated
page
41 ptep_get_and_clear() will remove a PTE from a page table and return it to
the caller
42 address will be the base address of the next PTE
43 Move to the next PTE
44 If there was no PTE, simply continue
46-51 If the page is present, perform basic checks and then free it
47 pte_page() uses the global mem_map to nd the struct page for the PTE
48-49
Make sure the page is a valid page and it is not reserved before calling
__free_page()
to free the physical page
50 Continue to the next PTE
52
If this line is reached, a PTE within the kernel address space was somehow
swapped out. Kernel memory is not swappable and so is a critical error
Appendix H
Slab Allocator
Contents
H.1 Cache Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . 444
H.1.1 Cache Creation . . . . . . . . . . . . . . . . . . . . .
H.1.1.1 Function: kmem_cache_create() . . . . . .
H.1.2 Calculating the Number of Objects on a Slab . . . .
H.1.2.1 Function: kmem_cache_estimate() . . . .
H.1.3 Cache Shrinking . . . . . . . . . . . . . . . . . . . .
H.1.3.1 Function: kmem_cache_shrink() . . . . . .
H.1.3.2 Function: __kmem_cache_shrink() . . . .
H.1.3.3 Function: __kmem_cache_shrink_locked()
H.1.4 Cache Destroying . . . . . . . . . . . . . . . . . . . .
H.1.4.1 Function: kmem_cache_destroy() . . . . .
H.1.5 Cache Reaping . . . . . . . . . . . . . . . . . . . . .
H.1.5.1 Function: kmem_cache_reap() . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
444
444
453
453
454
455
455
456
457
458
459
459
H.2.1 Storing the Slab Descriptor . . . . . . . . . . . . .
H.2.1.1 Function: kmem_cache_slabmgmt() . . .
H.2.1.2 Function: kmem_find_general_cachep()
H.2.2 Slab Creation . . . . . . . . . . . . . . . . . . . . .
H.2.2.1 Function: kmem_cache_grow() . . . . . .
H.2.3 Slab Destroying . . . . . . . . . . . . . . . . . . . .
H.2.3.1 Function: kmem_slab_destroy() . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
464
464
465
466
466
470
470
H.2 Slabs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
.
.
.
.
.
.
.
H.3 Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
H.3.1 Initialising Objects in a Slab . . . . . . . . . . . . . . . . . . . . 472
H.3.1.1 Function: kmem_cache_init_objs() . . . . . . . . . . . 472
H.3.2 Object Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
442
APPENDIX H. SLAB ALLOCATOR
443
H.3.2.1 Function: kmem_cache_alloc() . . . . . . . .
H.3.2.2 Function: __kmem_cache_alloc (UP Case)()
H.3.2.3 Function: __kmem_cache_alloc (SMP Case)()
H.3.2.4 Function: kmem_cache_alloc_head() . . . . .
H.3.2.5 Function: kmem_cache_alloc_one() . . . . . .
H.3.2.6 Function: kmem_cache_alloc_one_tail() . .
H.3.2.7 Function: kmem_cache_alloc_batch() . . . .
H.3.3 Object Freeing . . . . . . . . . . . . . . . . . . . . . . .
H.3.3.1 Function: kmem_cache_free() . . . . . . . . .
H.3.3.2 Function: __kmem_cache_free (UP Case)() .
H.3.3.3 Function: __kmem_cache_free (SMP Case)()
H.3.3.4 Function: kmem_cache_free_one() . . . . . .
H.3.3.5 Function: free_block() . . . . . . . . . . . .
H.3.3.6 Function: __free_block() . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
474
475
476
477
478
479
480
482
482
482
483
484
485
486
H.4.1 Initialising the Sizes Cache . . . . . . . . . . .
H.4.1.1 Function: kmem_cache_sizes_init()
H.4.2 kmalloc() . . . . . . . . . . . . . . . . . . . . .
H.4.2.1 Function: kmalloc() . . . . . . . . .
H.4.3 kfree() . . . . . . . . . . . . . . . . . . . . . .
H.4.3.1 Function: kfree() . . . . . . . . . . .
H.4 Sizes Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
487
487
488
488
489
489
H.5.1 Enabling Per-CPU Caches . . . . . . . . . . . . . . .
H.5.1.1 Function: enable_all_cpucaches() . . . .
H.5.1.2 Function: enable_cpucache() . . . . . . .
H.5.1.3 Function: kmem_tune_cpucache() . . . . .
H.5.2 Updating Per-CPU Information . . . . . . . . . . . .
H.5.2.1 Function: smp_call_function_all_cpus()
H.5.2.2 Function: do_ccupdate_local() . . . . . .
H.5.3 Draining a Per-CPU Cache . . . . . . . . . . . . . .
H.5.3.1 Function: drain_cpu_caches() . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
490
490
491
492
495
495
495
496
496
H.5 Per-CPU Object Cache . . . . . . . . . . . . . . . . . . . . . . . . 490
H.6 Slab Allocator Initialisation . . . . . . . . . . . . . . . . . . . . . 498
H.6.0.2 Function: kmem_cache_init() . . . . . . . . . . . . . . 498
H.7 Interfacing with the Buddy Allocator . . . . . . . . . . . . . . . 499
H.7.0.3 Function: kmem_getpages() . . . . . . . . . . . . . . . 499
H.7.0.4 Function: kmem_freepages() . . . . . . . . . . . . . . . 499
H.1 Cache Manipulation
444
H.1 Cache Manipulation
Contents
H.1 Cache Manipulation
H.1.1 Cache Creation
H.1.1.1 Function: kmem_cache_create()
H.1.2 Calculating the Number of Objects on a Slab
H.1.2.1 Function: kmem_cache_estimate()
H.1.3 Cache Shrinking
H.1.3.1 Function: kmem_cache_shrink()
H.1.3.2 Function: __kmem_cache_shrink()
H.1.3.3 Function: __kmem_cache_shrink_locked()
H.1.4 Cache Destroying
H.1.4.1 Function: kmem_cache_destroy()
H.1.5 Cache Reaping
H.1.5.1 Function: kmem_cache_reap()
444
444
444
453
453
454
455
455
456
457
458
459
459
H.1.1 Cache Creation
H.1.1.1
Function: kmem_cache_create()
(mm/slab.c)
The call graph for this function is shown in 8.3. This function is responsible for
the creation of a new cache and will be dealt with in chunks due to its size. The
chunks roughly are;
•
Perform basic sanity checks for bad usage
•
Perform debugging checks if
•
Allocate a
•
Align the object size to the word size
•
Calculate how many objects will t on a slab
•
Align the slab size to the hardware cache
•
Calculate colour osets
•
Initialise remaining elds in cache descriptor
•
Add the new cache to the cache chain
kmem_cache_t
CONFIG_SLAB_DEBUG
from the
cache_cache
is set
slab cache
H.1.1 Cache Creation (kmem_cache_create())
445
621 kmem_cache_t *
622 kmem_cache_create (const char *name, size_t size,
623
size_t offset, unsigned long flags,
void (*ctor)(void*, kmem_cache_t *, unsigned long),
624
void (*dtor)(void*, kmem_cache_t *, unsigned long))
625 {
626
const char *func_nm = KERN_ERR "kmem_create: ";
627
size_t left_over, align, slab_size;
628
kmem_cache_t *cachep = NULL;
629
633
if ((!name) ||
634
((strlen(name) >= CACHE_NAMELEN - 1)) ||
635
in_interrupt() ||
636
(size < BYTES_PER_WORD) ||
637
(size > (1<<MAX_OBJ_ORDER)*PAGE_SIZE) ||
638
(dtor && !ctor) ||
639
(offset < 0 || offset > size))
640
BUG();
641
Perform basic sanity checks for bad usage
622 The parameters of the function are
name The human readable name of the cache
size The size of an object
oset This is used to specify a specic alignment
for objects in the cache
but it usually left as 0
ags
ctor
dtor
Static cache ags
A constructor function to call for each object during slab creation
The corresponding destructor function. It is expected the destructor
function leaves an object in an initialised state
633-640
These are all serious usage bugs that prevent the cache even attempting
to create
634
If the human readable name is greater than the maximum size for a cache
name (CACHE_NAMELEN)
635 An interrupt handler cannot create a cache as access to interrupt-safe spinlocks
and semaphores are needed
636
The object size must be at least a word in size.
The slab allocator is not
suitable for objects whose size is measured in individual bytes
H.1.1 Cache Creation (kmem_cache_create())
446
637 The largest possible slab that can be created is 2MAX_OBJ_ORDER
number of
pages which provides 32 pages
638 A destructor cannot be used if no constructor is available
639 The oset cannot be before the slab or beyond the boundary of the rst page
640 Call BUG() to exit
642 #if DEBUG
643
if ((flags & SLAB_DEBUG_INITIAL) && !ctor) {
645
printk("%sNo con, but init state check
requested - %s\n", func_nm, name);
646
flags &= ~SLAB_DEBUG_INITIAL;
647
}
648
649
if ((flags & SLAB_POISON) && ctor) {
651
printk("%sPoisoning requested, but con given - %s\n",
func_nm, name);
652
flags &= ~SLAB_POISON;
653
}
654 #if FORCED_DEBUG
655
if ((size < (PAGE_SIZE>>3)) &&
!(flags & SLAB_MUST_HWCACHE_ALIGN))
660
flags |= SLAB_RED_ZONE;
661
if (!ctor)
662
flags |= SLAB_POISON;
663 #endif
664 #endif
670
BUG_ON(flags & ~CREATE_MASK);
This block performs debugging checks if
643-646
The ag
SLAB_DEBUG_INITIAL
CONFIG_SLAB_DEBUG
is set
requests that the constructor check the
objects to make sure they are in an initialised state. For this, a constructor
must exist. If it does not, the ag is cleared
649-653
A slab can be poisoned with a known pattern to make sure an object
wasn't used before it was allocated but a constructor would ruin this pattern
falsely reporting a bug. If a constructor exists, remove the
SLAB_POISON
ag
if set
655-660
Only small objects will be red zoned for debugging.
objects would cause severe fragmentation
661-662 If there is no constructor, set the poison bit
Red zoning large
H.1.1 Cache Creation (kmem_cache_create())
670
The
CREATE_MASK
447
is set with all the allowable ags
kmem_cache_create()
(See Section H.1.1.1) can be called with. This prevents callers using debugging
ags when they are not available and
673
674
675
676
BUG()s
it instead
cachep =
(kmem_cache_t *) kmem_cache_alloc(&cache_cache,
SLAB_KERNEL);
if (!cachep)
goto opps;
memset(cachep, 0, sizeof(kmem_cache_t));
Allocate a
kmem_cache_t
from the
cache_cache
slab cache.
673 Allocate a cache descriptor object from the cache_cache with kmem_cache_alloc()
(See Section H.3.2.1)
674-675 If out of memory goto opps which handles the oom situation
676 Zero ll the object to prevent surprises with uninitialised data
682
683
684
685
if (size & (BYTES_PER_WORD-1)) {
size += (BYTES_PER_WORD-1);
size &= ~(BYTES_PER_WORD-1);
printk("%sForcing size word alignment
- %s\n", func_nm, name);
}
686
687
688 #if DEBUG
689
if (flags & SLAB_RED_ZONE) {
694
flags &= ~SLAB_HWCACHE_ALIGN;
695
size += 2*BYTES_PER_WORD;
696
}
697 #endif
698
align = BYTES_PER_WORD;
699
if (flags & SLAB_HWCACHE_ALIGN)
700
align = L1_CACHE_BYTES;
701
703
if (size >= (PAGE_SIZE>>3))
708
flags |= CFLGS_OFF_SLAB;
709
710
if (flags & SLAB_HWCACHE_ALIGN) {
714
while (size < align/2)
715
align /= 2;
716
size = (size+align-1)&(~(align-1));
717
}
Align the object size to some word-sized boundary.
H.1.1 Cache Creation (kmem_cache_create())
448
682 If the size is not aligned to the size of a word then...
683-684
Increase the object by the size of a word then mask out the lower bits,
this will eectively round the object size up to the next word boundary
685 Print out an informational message for debugging purposes
688-697 If debugging is enabled then the alignments have to change slightly
694 Do not bother trying to align things to the hardware cache if the slab will be
red zoned. The red zoning of the object is going to oset it by moving the
object one word away from the cache boundary
695 The size of the object increases by two BYTES_PER_WORD to store the red zone
mark at either end of the object
698
Initialise the alignment to be to a word boundary.
This will change if the
caller has requested a CPU cache alignment
699-700 If requested, align the objects to the L1 CPU cache
703
If the objects are large, store the slab descriptors o-slab.
This will allow
better packing of objects into the slab
710
If hardware cache alignment is requested, the size of the objects must be
adjusted to align themselves to the hardware cache
714-715
Try and pack objects into one cache line if they t while still keeping
the alignment. This is important to arches (e.g. Alpha or Pentium 4) with
large L1 cache bytes.
align
hardware cache alignment.
will be adjusted to be the smallest that will give
For machines with large L1 cache lines, two or
more small objects may t into each line. For example, two objects from the
size-32 cache will t on one cache line from a Pentium 4
716 Round the cache size up to the hardware cache alignment
724
do {
725
unsigned int break_flag = 0;
726 cal_wastage:
727
kmem_cache_estimate(cachep->gfporder,
size, flags,
728
&left_over,
&cachep->num);
729
if (break_flag)
730
break;
731
if (cachep->gfporder >= MAX_GFP_ORDER)
732
break;
733
if (!cachep->num)
H.1.1 Cache Creation (kmem_cache_create())
734
735
449
goto next;
if (flags & CFLGS_OFF_SLAB &&
cachep->num > offslab_limit) {
cachep->gfporder--;
break_flag++;
goto cal_wastage;
}
737
738
739
740
741
746
if (cachep->gfporder >= slab_break_gfp_order)
747
break;
748
749
if ((left_over*8) <= (PAGE_SIZE<<cachep->gfporder))
750
break;
751 next:
752
cachep->gfporder++;
753
} while (1);
754
755
if (!cachep->num) {
756
printk("kmem_cache_create: couldn't
create cache %s.\n", name);
757
kmem_cache_free(&cache_cache, cachep);
758
cachep = NULL;
759
goto opps;
760
}
Calculate how many objects will t on a slab and adjust the slab size as necessary
727-728 kmem_cache_estimate()
(see Section H.1.2.1) calculates the number of
objects that can t on a slab at the current gfp order and what the amount of
leftover bytes will be
729-730 The break_flag is set if the number of objects tting on the slab exceeds
the number that can be kept when oslab slab descriptors are used
731-732 The order number of pages used must not exceed MAX_GFP_ORDER (5)
733-734 If even one object didn't ll, goto next:
which will increase the
gfporder
used for the cache
735
If the slab descriptor is kept o-cache but the number of objects exceeds the
number that can be tracked with bufctl's o-slab then ...
737 Reduce the order number of pages used
738 Set the break_flag so the loop will exit
739 Calculate the new wastage gures
H.1.1 Cache Creation (kmem_cache_create())
746-747
The
slab_break_gfp_order
450
is the order to not exceed unless 0 objects
t on the slab. This check ensures the order is not exceeded
749-759
This is a rough check for internal fragmentation.
If the wastage as a
fraction of the total size of the cache is less than one eight, it is acceptable
752
If the fragmentation is too high, increase the gfp order and recalculate the
number of objects that can be stored and the wastage
755 If after adjustments, objects still do not t in the cache, it cannot be created
757-758 Free the cache descriptor and set the pointer to NULL
758 Goto opps which simply returns the NULL pointer
761
762
767
768
769
770
slab_size = L1_CACHE_ALIGN(
cachep->num*sizeof(kmem_bufctl_t) +
sizeof(slab_t));
if (flags & CFLGS_OFF_SLAB && left_over >= slab_size) {
flags &= ~CFLGS_OFF_SLAB;
left_over -= slab_size;
}
Align the slab size to the hardware cache
761 slab_size is the total size of the slab descriptor not
It is the size
the size of the slab itself.
slab_t struct and the number of objects * size of the bufctl
767-769 If there is enough left over space for the slab descriptor and it was specied
to place the descriptor o-slab, remove the ag and update the amount of
left_over
bytes there is. This will impact the cache colouring but with the
large objects associated with o-slab descriptors, this is not a problem
773
774
775
776
777
778
offset += (align-1);
offset &= ~(align-1);
if (!offset)
offset = L1_CACHE_BYTES;
cachep->colour_off = offset;
cachep->colour = left_over/offset;
Calculate colour osets.
773-774 offset is the oset within the page the caller requested.
This will make
sure the oset requested is at the correct alignment for cache usage
775-776 If somehow the oset is 0, then set it to be aligned for the CPU cache
H.1.1 Cache Creation (kmem_cache_create())
777
451
This is the oset to use to keep objects on dierent cache lines.
Each slab
created will be given a dierent colour oset
778 This is the number of dierent osets that can be used
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
if (!cachep->gfporder && !(flags & CFLGS_OFF_SLAB))
flags |= CFLGS_OPTIMIZE;
cachep->flags = flags;
cachep->gfpflags = 0;
if (flags & SLAB_CACHE_DMA)
cachep->gfpflags |= GFP_DMA;
spin_lock_init(&cachep->spinlock);
cachep->objsize = size;
INIT_LIST_HEAD(&cachep->slabs_full);
INIT_LIST_HEAD(&cachep->slabs_partial);
INIT_LIST_HEAD(&cachep->slabs_free);
if (flags & CFLGS_OFF_SLAB)
cachep->slabp_cache =
kmem_find_general_cachep(slab_size,0);
cachep->ctor = ctor;
cachep->dtor = dtor;
strcpy(cachep->name, name);
796
797
799
800
801 #ifdef CONFIG_SMP
802
if (g_cpucache_up)
803
enable_cpucache(cachep);
804 #endif
Initialise remaining elds in cache descriptor
781-782 For caches with slabs of only 1 page, the CFLGS_OPTIMIZE ag is set.
In
reality it makes no dierence as the ag is unused
784 Set the cache static ags
785 Zero out the gfpags.
Defunct operation as
memset() after the cache descriptor
was allocated would do this
786-787
If the slab is for DMA use, set the
will use
ZONE_DMA
GFP_DMA
ag so the buddy allocator
788 Initialise the spinlock for access the cache
789 Copy in the object size, which now takes hardware cache alignment if necessary
790-792 Initialise the slab lists
H.1.1 Cache Creation (kmem_cache_create())
452
794-795 If the descriptor is kept o-slab, allocate a slab manager and place it for
use in
slabp_cache.
See Section H.2.1.2
796-797 Set the pointers to the constructor and destructor functions
799 Copy in the human readable name
802-803 If per-cpu caches are enabled, create a set for this cache.
806
807
808
809
810
811
See Section 8.5
down(&cache_chain_sem);
{
struct list_head *p;
list_for_each(p, &cache_chain) {
kmem_cache_t *pc = list_entry(p,
kmem_cache_t, next);
812
814
if (!strcmp(pc->name, name))
815
BUG();
816
}
817
}
818
822
list_add(&cachep->next, &cache_chain);
823
up(&cache_chain_sem);
824 opps:
825
return cachep;
826 }
Add the new cache to the cache chain
806 Acquire the semaphore used to synchronise access to the cache chain
810-816
Check every cache on the cache chain and make sure there is no other
cache with the same name. If there is, it means two caches of the same type
are been created which is a serious bug
811 Get the cache from the list
814-815 Compare the names and if they match, BUG().
It is worth noting that the
new cache is not deleted, but this error is the result of sloppy programming
during development and not a normal scenario
822 Link the cache into the chain.
823 Release the cache chain semaphore.
825 Return the new cache pointer
H.1.2 Calculating the Number of Objects on a Slab
453
H.1.2 Calculating the Number of Objects on a Slab
H.1.2.1
Function: kmem_cache_estimate()
(mm/slab.c)
During cache creation, it is determined how many objects can be stored in a slab
and how much waste-age there will be. The following function calculates how many
objects may be stored, taking into account if the slab and bufctl's must be stored
on-slab.
388 static void kmem_cache_estimate (unsigned long gfporder,
size_t size,
389
int flags, size_t *left_over, unsigned int *num)
390 {
391
int i;
392
size_t wastage = PAGE_SIZE<<gfporder;
393
size_t extra = 0;
394
size_t base = 0;
395
396
if (!(flags & CFLGS_OFF_SLAB)) {
397
base = sizeof(slab_t);
398
extra = sizeof(kmem_bufctl_t);
399
}
400
i = 0;
401
while (i*size + L1_CACHE_ALIGN(base+i*extra) <= wastage)
402
i++;
403
if (i > 0)
404
i--;
405
406
if (i > SLAB_LIMIT)
407
i = SLAB_LIMIT;
408
409
*num = i;
410
wastage -= i*size;
411
wastage -= L1_CACHE_ALIGN(base+i*extra);
412
*left_over = wastage;
413 }
388 The parameters of the function are as follows
gfporder The 2gfporder number of pages to allocate for each slab
size The size of each object
ags The cache ags
left_over The number of bytes left over in the slab. Returned to caller
num The number of objects that will t in a slab. Returned to caller
H.1.3 Cache Shrinking
392 wastage
454
is decremented through the function. It starts with the maximum
possible amount of wastage.
393 extra is the number of bytes needed to store kmem_bufctl_t
394 base is where usable memory in the slab starts
396
If the slab descriptor is kept on cache, the base begins at the end of the
slab_t struct and
of kmem_bufctl_t
the number of bytes needed to store the bufctl is the size
400 i becomes the number of objects the slab can hold
401-402 This counts up the number of objects that the cache can store. i*size is
the the size of the object itself.
trickier.
L1_CACHE_ALIGN(base+i*extra)
is slightly
This is calculating the amount of memory needed to store the
kmem_bufctl_t
needed for every object in the slab.
As it is at the begin-
ning of the slab, it is L1 cache aligned so that the rst object in the slab will
be aligned to hardware cache.
needed to hold a
i*extra
will calculate the amount of space
kmem_bufctl_t for this object.
As wast-age starts out as the
size of the slab, its use is overloaded here.
403-404 Because the previous loop counts until the slab overows, the number of
objects that can be stored is
i-1.
406-407 SLAB_LIMIT is the absolute largest number of objects a slab can store.
is dened as 0xFFFE as this the largest number
kmem_bufctl_t(),
Is
which
is an unsigned integer, can hold
409 num is now the number of objects a slab can hold
410 Take away the space taken up by all the objects from wastage
411 Take away the space taken up by the kmem_bufctl_t
412 Wast-age has now been calculated as the left over space in the slab
H.1.3 Cache Shrinking
kmem_cache_shrink() is shown in Figure 8.5. Two varieties
of shrink functions are provided. kmem_cache_shrink() removes all slabs from
slabs_free and returns the number of pages freed as a result. __kmem_cache_shrink()
frees all slabs from slabs_free and then veries that slabs_partial and slabs_full
The call graph for
are empty. This is important during cache destruction when it doesn't matter how
many pages are freed, just that the cache is empty.
H.1.3.1 Function: kmem_cache_shrink()
H.1.3.1
Function: kmem_cache_shrink()
455
(mm/slab.c)
This function performs basic debugging checks and then acquires the cache descriptor lock before freeing slabs. At one time, it also used to call
drain_cpu_caches()
to free up objects on the per-cpu cache. It is curious that this was removed as it
is possible slabs could not be freed due to an object been allocation on a per-cpu
cache but not in use.
966 int kmem_cache_shrink(kmem_cache_t *cachep)
967 {
968
int ret;
969
970
if (!cachep || in_interrupt() ||
!is_chained_kmem_cache(cachep))
971
BUG();
972
973
spin_lock_irq(&cachep->spinlock);
974
ret = __kmem_cache_shrink_locked(cachep);
975
spin_unlock_irq(&cachep->spinlock);
976
977
return ret << cachep->gfporder;
978 }
966 The parameter is the cache been shrunk
970 Check that
•
The cache pointer is not NULL
•
That an interrupt is not the caller
•
That the cache is on the cache chain and not a bad pointer
973 Acquire the cache descriptor lock and disable interrupts
974 Shrink the cache
975 Release the cache lock and enable interrupts
976
This returns the number of pages freed but does not take into account the
objects freed by draining the CPU.
H.1.3.2
Function: __kmem_cache_shrink()
(mm/slab.c)
This function is identical to kmem_cache_shrink() except it returns if the cache
is empty or not. This is important during cache destruction when it is not important
how much memory was freed, just that it is safe to delete the cache and not leak
memory.
H.1.3 Cache Shrinking (__kmem_cache_shrink())
456
945 static int __kmem_cache_shrink(kmem_cache_t *cachep)
946 {
947
int ret;
948
949
drain_cpu_caches(cachep);
950
951
spin_lock_irq(&cachep->spinlock);
952
__kmem_cache_shrink_locked(cachep);
953
ret = !list_empty(&cachep->slabs_full) ||
954
!list_empty(&cachep->slabs_partial);
955
spin_unlock_irq(&cachep->spinlock);
956
return ret;
957 }
949 Remove all objects from the per-CPU objects cache
951 Acquire the cache descriptor lock and disable interrupts
952 Free all slabs in the slabs_free list
954-954 Check the slabs_partial and slabs_full lists are empty
955 Release the cache descriptor lock and re-enable interrupts
956 Return if the cache has all its slabs free or not
H.1.3.3
Function: __kmem_cache_shrink_locked()
(mm/slab.c)
This does the dirty work of freeing slabs. It will keep destroying them until the
growing ag gets set, indicating the cache is in use or until there is no more slabs
in
slabs_free.
917 static int __kmem_cache_shrink_locked(kmem_cache_t *cachep)
918 {
919
slab_t *slabp;
920
int ret = 0;
921
923
while (!cachep->growing) {
924
struct list_head *p;
925
926
p = cachep->slabs_free.prev;
927
if (p == &cachep->slabs_free)
928
break;
929
930
slabp = list_entry(cachep->slabs_free.prev,
slab_t, list);
H.1.4 Cache Destroying
457
931 #if DEBUG
932
if (slabp->inuse)
933
BUG();
934 #endif
935
list_del(&slabp->list);
936
937
spin_unlock_irq(&cachep->spinlock);
938
kmem_slab_destroy(cachep, slabp);
939
ret++;
940
spin_lock_irq(&cachep->spinlock);
941
}
942
return ret;
943 }
923 While the cache is not growing, free slabs
926-930 Get the last slab on the slabs_free list
932-933 If debugging is available, make sure it is not in use.
should not be on the
slabs_free
If it is not in use, it
list in the rst place
935 Remove the slab from the list
937 Re-enable interrupts.
This function is called with interrupts disabled and this
is to free the interrupt as quickly as possible.
938 Delete the slab with kmem_slab_destroy() (See Section H.2.3.1)
939 Record the number of slabs freed
940 Acquire the cache descriptor lock and disable interrupts
H.1.4 Cache Destroying
When a module is unloaded, it is responsible for destroying any cache is has created
as during module loading, it is ensured there is not two caches of the same name.
Core kernel code often does not destroy its caches as their existence persists for the
life of the system. The steps taken to destroy a cache are
•
Delete the cache from the cache chain
•
Shrink the cache to delete all slabs (see Section 8.1.8)
•
Free any per CPU caches (kfree())
•
Delete the cache descriptor from the
cache_cache
(see Section: 8.3.3)
H.1.4.1 Function: kmem_cache_destroy()
H.1.4.1
458
Function: kmem_cache_destroy()
(mm/slab.c)
The call graph for this function is shown in Figure 8.7.
997 int
998 {
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
kmem_cache_destroy (kmem_cache_t * cachep)
if (!cachep || in_interrupt() || cachep->growing)
BUG();
/* Find the cache in the chain of caches. */
down(&cache_chain_sem);
/* the chain is never empty, cache_cache is never destroyed */
if (clock_searchp == cachep)
clock_searchp = list_entry(cachep->next.next,
kmem_cache_t, next);
list_del(&cachep->next);
up(&cache_chain_sem);
if (__kmem_cache_shrink(cachep)) {
printk(KERN_ERR
"kmem_cache_destroy: Can't free all objects %p\n",
cachep);
down(&cache_chain_sem);
list_add(&cachep->next,&cache_chain);
up(&cache_chain_sem);
return 1;
}
#ifdef CONFIG_SMP
{
int i;
for (i = 0; i < NR_CPUS; i++)
kfree(cachep->cpudata[i]);
}
#endif
kmem_cache_free(&cache_cache, cachep);
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029 }
return 0;
999-1000
Sanity check.
Make sure the
cachep
is not null, that an interrupt is
not trying to do this and that the cache has not been marked as growing,
indicating it is in use
1003 Acquire the semaphore for accessing the cache chain
1005-1007 Acquire the list entry from the cache chain
H.1.5 Cache Reaping
459
1008 Delete this cache from the cache chain
1009 Release the cache chain semaphore
1011 Shrink the cache to free all slabs with __kmem_cache_shrink() (See Section H.1.3.2)
1012-1017
The shrink function returns true if there is still slabs in the cache. If
there is, the cache cannot be destroyed so it is added back into the cache chain
and the error reported
1022-1023 If SMP is enabled, the per-cpu data structures are deleted with kfree()
(See Section H.4.3.1)
1026 Delete the cache descriptor from the cache_cache with kmem_cache_free()
(See Section H.3.3.1)
H.1.5 Cache Reaping
H.1.5.1
Function: kmem_cache_reap()
(mm/slab.c)
The call graph for this function is shown in Figure 8.4. Because of the size of
this function, it will be broken up into three separate sections. The rst is simple
function preamble. The second is the selection of a cache to reap and the third is
the freeing of the slabs. The basic tasks were described in Section 8.1.7.
1738 int kmem_cache_reap (int gfp_mask)
1739 {
1740
slab_t *slabp;
1741
kmem_cache_t *searchp;
1742
kmem_cache_t *best_cachep;
1743
unsigned int best_pages;
1744
unsigned int best_len;
1745
unsigned int scan;
1746
int ret = 0;
1747
1748
if (gfp_mask & __GFP_WAIT)
1749
down(&cache_chain_sem);
1750
else
1751
if (down_trylock(&cache_chain_sem))
1752
return 0;
1753
1754
scan = REAP_SCANLEN;
1755
best_len = 0;
1756
best_pages = 0;
1757
best_cachep = NULL;
1758
searchp = clock_searchp;
H.1.5 Cache Reaping (kmem_cache_reap())
1738
460
The only parameter is the GFP ag.
__GFP_WAIT
ag.
As the only caller,
The only check made is against the
kswapd,
can sleep, this parameter is
virtually worthless
1748-1749 Can the caller sleep?
If yes, then acquire the semaphore
1751-1752 Else, try and acquire the semaphore and if not available, return
1754 REAP_SCANLEN (10) is the number of caches to examine.
1758 Set searchp to be the last cache that was examined at the last reap
1759
do {
1760
unsigned int pages;
1761
struct list_head* p;
1762
unsigned int full_free;
1763
1765
if (searchp->flags & SLAB_NO_REAP)
1766
goto next;
1767
spin_lock_irq(&searchp->spinlock);
1768
if (searchp->growing)
1769
goto next_unlock;
1770
if (searchp->dflags & DFLGS_GROWN) {
1771
searchp->dflags &= ~DFLGS_GROWN;
1772
goto next_unlock;
1773
}
1774 #ifdef CONFIG_SMP
1775
{
1776
cpucache_t *cc = cc_data(searchp);
1777
if (cc && cc->avail) {
1778
__free_block(searchp, cc_entry(cc),
cc->avail);
1779
cc->avail = 0;
1780
}
1781
}
1782 #endif
1783
1784
full_free = 0;
1785
p = searchp->slabs_free.next;
1786
while (p != &searchp->slabs_free) {
1787
slabp = list_entry(p, slab_t, list);
1788 #if DEBUG
1789
if (slabp->inuse)
1790
BUG();
1791 #endif
1792
full_free++;
H.1.5 Cache Reaping (kmem_cache_reap())
1793
1794
1795
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
}
461
p = p->next;
pages = full_free * (1<<searchp->gfporder);
if (searchp->ctor)
pages = (pages*4+1)/5;
if (searchp->gfporder)
pages = (pages*4+1)/5;
if (pages > best_pages) {
best_cachep = searchp;
best_len = full_free;
best_pages = pages;
if (pages >= REAP_PERFECT) {
clock_searchp =
list_entry(searchp->next.next,
kmem_cache_t,next);
goto perfect;
}
}
next_unlock:
spin_unlock_irq(&searchp->spinlock);
next:
searchp =
list_entry(searchp->next.next,kmem_cache_t,next);
} while (--scan && searchp != clock_searchp);
This block examines
REAP_SCANLEN
number of caches to select one to free
1767 Acquire an interrupt safe lock to the cache descriptor
1768-1769 If the cache is growing, skip it
1770-1773 If the cache has grown recently, skip it and clear the ag
1775-1781 Free any per CPU objects to the global pool
1786-1794 Count the number of slabs in the slabs_free list
1801 Calculate the number of pages all the slabs hold
1802-1803 If the objects have constructors, reduce the page count by one fth to
make it less likely to be selected for reaping
1804-1805
If the slabs consist of more than one page, reduce the page count by
one fth. This is because high order pages are hard to acquire
1806 If this is the best candidate found for reaping so far, check if it is perfect for
reaping
H.1.5 Cache Reaping (kmem_cache_reap())
462
1807-1809 Record the new maximums
1808 best_len is recorded so that it is easy to know how many slabs is half of the
slabs in the free list
1810 If this cache is perfect for reaping then
1811 Update clock_searchp
1812 Goto perfect where half the slabs will be freed
1816
This label is reached if it was found the cache was growing after acquiring
the lock
1817 Release the cache descriptor lock
1818 Move to the next entry in the cache chain
1820
Scan while
REAP_SCANLEN
has not been reached and we have not cycled
around the whole cache chain
1822
clock_searchp = searchp;
1823
1824
if (!best_cachep)
1826
goto out;
1827
1828
spin_lock_irq(&best_cachep->spinlock);
1829 perfect:
1830
/* free only 50% of the free slabs */
1831
best_len = (best_len + 1)/2;
1832
for (scan = 0; scan < best_len; scan++) {
1833
struct list_head *p;
1834
1835
if (best_cachep->growing)
1836
break;
1837
p = best_cachep->slabs_free.prev;
1838
if (p == &best_cachep->slabs_free)
1839
break;
1840
slabp = list_entry(p,slab_t,list);
1841 #if DEBUG
1842
if (slabp->inuse)
1843
BUG();
1844 #endif
1845
list_del(&slabp->list);
1846
STATS_INC_REAPED(best_cachep);
1847
1848
/* Safe to drop the lock. The slab is no longer
1849
* lined to the cache.
H.1.5 Cache Reaping (kmem_cache_reap())
463
1850
*/
1851
spin_unlock_irq(&best_cachep->spinlock);
1852
kmem_slab_destroy(best_cachep, slabp);
1853
spin_lock_irq(&best_cachep->spinlock);
1854
}
1855
spin_unlock_irq(&best_cachep->spinlock);
1856
ret = scan * (1 << best_cachep->gfporder);
1857 out:
1858
up(&cache_chain_sem);
1859
return ret;
1860 }
This block will free half of the slabs from the selected cache
1822 Update clock_searchp for the next cache reap
1824-1826 If a cache was not found, goto out to free the cache chain and exit
1828
Acquire the cache chain spinlock and disable interrupts.
The
cachep
de-
scriptor has to be held by an interrupt safe lock as some caches may be used
from interrupt context. The slab allocator has no way to dierentiate between
interrupt safe and unsafe caches
1831 Adjust best_len to be the number of slabs to free
1832-1854 Free best_len number of slabs
1835-1847 If the cache is growing, exit
1837 Get a slab from the list
1838-1839 If there is no slabs left in the list, exit
1840 Get the slab pointer
1842-1843 If debugging is enabled, make sure there is no active objects in the slab
1845 Remove the slab from the slabs_free list
1846 Update statistics if enabled
1851 Free the cache descriptor and enable interrupts
1852 Destroy the slab.
See Section 8.2.8
1851 Re-acquire the cache descriptor spinlock and disable interrupts
1855 Free the cache descriptor and enable interrupts
1856 ret is the number of pages that was freed
1858-1859 Free the cache semaphore and return the number of pages freed
H.2 Slabs
464
H.2 Slabs
Contents
H.2 Slabs
H.2.1 Storing the Slab Descriptor
H.2.1.1 Function: kmem_cache_slabmgmt()
H.2.1.2 Function: kmem_find_general_cachep()
H.2.2 Slab Creation
H.2.2.1 Function: kmem_cache_grow()
H.2.3 Slab Destroying
H.2.3.1 Function: kmem_slab_destroy()
464
464
464
465
466
466
470
470
H.2.1 Storing the Slab Descriptor
H.2.1.1
Function: kmem_cache_slabmgmt()
(mm/slab.c)
This function will either allocate allocate space to keep the slab descriptor o
cache or reserve enough space at the beginning of the slab for the descriptor and
the
bufctls.
1032 static inline slab_t * kmem_cache_slabmgmt (
kmem_cache_t *cachep,
1033
void *objp,
int colour_off,
int local_flags)
1034 {
1035
slab_t *slabp;
1036
1037
if (OFF_SLAB(cachep)) {
1039
slabp = kmem_cache_alloc(cachep->slabp_cache,
local_flags);
1040
if (!slabp)
1041
return NULL;
1042
} else {
1047
slabp = objp+colour_off;
1048
colour_off += L1_CACHE_ALIGN(cachep->num *
1049
sizeof(kmem_bufctl_t) +
sizeof(slab_t));
1050
}
1051
slabp->inuse = 0;
1052
slabp->colouroff = colour_off;
1053
slabp->s_mem = objp+colour_off;
1054
1055
return slabp;
1056 }
H.2.1 Storing the Slab Descriptor (kmem_cache_slabmgmt())
1032
465
The parameters of the function are
cachep The cache the slab is to be allocated to
objp When the function is called, this points to the beginning of the slab
colour_o The colour oset for this slab
local_ags These are the ags for the cache
1037-1042
1039
If the slab descriptor is kept o cache....
Allocate memory from the sizes cache. During cache creation,
slabp_cache
is set to the appropriate size cache to allocate from.
1040
If the allocation failed, return
1042-1050
1047
Reserve space at the beginning of the slab
The address of the slab will be the beginning of the slab (objp) plus the
colour oset
1048 colour_off is calculated to be the oset where the rst object will be placed.
The address is L1 cache aligned.
cachep->num * sizeof(kmem_bufctl_t) is
the amount of space needed to hold the bufctls for each object in the slab and
sizeof(slab_t) is the size of the slab descriptor.
This eectively has reserved
the space at the beginning of the slab
1051 The number of objects in use on the slab is 0
1052 The colouroff is updated for placement of the new object
1053
The address of the rst object is calculated as the address of the beginning
of the slab plus the oset
H.2.1.2
Function: kmem_find_general_cachep()
(mm/slab.c)
If the slab descriptor is to be kept o-slab, this function, called during cache
creation will nd the appropriate sizes cache to use and will be stored within the
cache descriptor in the eld
slabp_cache.
1620 kmem_cache_t * kmem_find_general_cachep (size_t size,
int gfpflags)
1621 {
1622
cache_sizes_t *csizep = cache_sizes;
1623
1628
for ( ; csizep->cs_size; csizep++) {
1629
if (size > csizep->cs_size)
1630
continue;
1631
break;
1632
}
H.2.2 Slab Creation
1633
466
return (gfpflags & GFP_DMA) ? csizep->cs_dmacachep :
csizep->cs_cachep;
1634 }
1620 size is the size of the slab descriptor. gfpflags is always 0 as DMA memory
is not needed for a slab descriptor
1628-1632
Starting with the smallest size, keep increasing the size until a cache
is found with buers large enough to store the slab descriptor
1633
Return either a normal or DMA sized cache depending on the
passed in. In reality, only the
cs_cachep
gfpflags
is ever passed back
H.2.2 Slab Creation
H.2.2.1
Function: kmem_cache_grow()
(mm/slab.c)
The call graph for this function is shown in 8.11. The basic tasks for this function
are;
•
Perform basic sanity checks to guard against bad usage
•
Calculate colour oset for objects in this slab
•
Allocate memory for slab and acquire a slab descriptor
•
Link the pages used for the slab to the slab and cache descriptors
•
Initialise objects in the slab
•
Add the slab to the cache
1105 static int kmem_cache_grow (kmem_cache_t * cachep, int flags)
1106 {
1107
slab_t *slabp;
1108
struct page
*page;
1109
void
*objp;
1110
size_t
offset;
1111
unsigned int
i, local_flags;
1112
unsigned long
ctor_flags;
1113
unsigned long
save_flags;
Basic declarations. The parameters of the function are
cachep
ags
The cache to allocate a new slab to
The ags for a slab creation
H.2.2 Slab Creation (kmem_cache_grow())
1118
1119
1120
1121
1122
1129
1130
1131
1132
1133
1134
1139
467
if (flags & ~(SLAB_DMA|SLAB_LEVEL_MASK|SLAB_NO_GROW))
BUG();
if (flags & SLAB_NO_GROW)
return 0;
if (in_interrupt() &&
(flags & SLAB_LEVEL_MASK) != SLAB_ATOMIC)
BUG();
ctor_flags = SLAB_CTOR_CONSTRUCTOR;
local_flags = (flags & SLAB_LEVEL_MASK);
if (local_flags == SLAB_ATOMIC)
ctor_flags |= SLAB_CTOR_ATOMIC;
Perform basic sanity checks to guard against bad usage. The checks are made
here rather than
kmem_cache_alloc()
to protect the speed-critical path. There is
no point checking the ags every time an object needs to be allocated.
1118-1119 Make sure only allowable ags are used for allocation
1120-1121 Do not grow the cache if this is set.
In reality, it is never set
1129-1130 If this called within interrupt context, make sure the ATOMIC ag is set
so we don't sleep when
kmem_getpages()(See
Section H.7.0.3) is called
1132 This ag tells the constructor it is to init the object
1133 The local_ags are just those relevant to the page allocator
1134-1139 If the SLAB_ATOMIC ag is set, the constructor needs to know about it
in case it wants to make new allocations
1142
1143
1145
1146
1147
1148
1149
1150
1151
1152
1153
spin_lock_irqsave(&cachep->spinlock, save_flags);
offset = cachep->colour_next;
cachep->colour_next++;
if (cachep->colour_next >= cachep->colour)
cachep->colour_next = 0;
offset *= cachep->colour_off;
cachep->dflags |= DFLGS_GROWN;
cachep->growing++;
spin_unlock_irqrestore(&cachep->spinlock, save_flags);
Calculate colour oset for objects in this slab
1142 Acquire an interrupt safe lock for accessing the cache descriptor
H.2.2 Slab Creation (kmem_cache_grow())
468
1145 Get the oset for objects in this slab
1146 Move to the next colour oset
1147-1148 If colour has been reached, there is no more osets available, so reset
colour_next
to 0
1149 colour_off is the size of each oset, so offset * colour_off will give how
many bytes to oset the objects to
1150 Mark the cache that it is growing so that kmem_cache_reap() (See Section H.1.5.1)
will ignore this cache
1152 Increase the count for callers growing this cache
1153 Free the spinlock and re-enable interrupts
1165
1166
1167
1169
1160
if (!(objp = kmem_getpages(cachep, flags)))
goto failed;
if (!(slabp = kmem_cache_slabmgmt(cachep,
objp, offset,
local_flags)))
goto opps1;
Allocate memory for slab and acquire a slab descriptor
1165-1166 Allocate pages from the page allocator for the slab with kmem_getpages()
(See Section H.7.0.3)
1169 Acquire a slab descriptor with kmem_cache_slabmgmt() (See Section H.2.1.1)
1173
1174
1175
1176
1177
1178
1179
1180
i = 1 << cachep->gfporder;
page = virt_to_page(objp);
do {
SET_PAGE_CACHE(page, cachep);
SET_PAGE_SLAB(page, slabp);
PageSetSlab(page);
page++;
} while (--i);
Link the pages for the slab used to the slab and cache descriptors
1173 i is the number of pages used for the slab.
Each page has to be linked to the
slab and cache descriptors.
1174 objp
is a pointer to the beginning of the slab. The macro
will give the
struct page
for that address
virt_to_page()
H.2.2 Slab Creation (kmem_cache_grow())
469
1175-1180 Link each pages list eld to the slab and cache descriptors
1176 SET_PAGE_CACHE() links the page to the cache descriptor using the page→list.next
eld
1178 SET_PAGE_SLAB() links the page to the slab descriptor using the page→list.prev
eld
1178 Set the PG_slab page ag.
The full set of
PG_
ags is listed in Table 2.1
1179 Move to the next page for this slab to be linked
1182
kmem_cache_init_objs(cachep, slabp, ctor_flags);
1182 Initialise all objects (See Section H.3.1.1)
1184
1185
1186
1188
1189
1190
1191
1192
1193
spin_lock_irqsave(&cachep->spinlock, save_flags);
cachep->growing--;
list_add_tail(&slabp->list, &cachep->slabs_free);
STATS_INC_GROWN(cachep);
cachep->failures = 0;
spin_unlock_irqrestore(&cachep->spinlock, save_flags);
return 1;
Add the slab to the cache
1184 Acquire the cache descriptor spinlock in an interrupt safe fashion
1185 Decrease the growing count
1188 Add the slab to the end of the slabs_free list
1189 If STATS is set, increase the cachep→grown eld STATS_INC_GROWN()
1190 Set failures to 0.
This eld is never used elsewhere
1192 Unlock the spinlock in an interrupt safe fashion
1193 Return success
1194 opps1:
1195
kmem_freepages(cachep, objp);
1196 failed:
1197
spin_lock_irqsave(&cachep->spinlock, save_flags);
1198
cachep->growing--;
1199
spin_unlock_irqrestore(&cachep->spinlock, save_flags);
1300
return 0;
1301 }
H.2.3 Slab Destroying
470
Error handling
1194-1195 opps1
is reached if the pages for the slab were allocated. They must
be freed
1197 Acquire the spinlock for accessing the cache descriptor
1198 Reduce the growing count
1199 Release the spinlock
1300 Return failure
H.2.3 Slab Destroying
H.2.3.1
Function: kmem_slab_destroy()
(mm/slab.c)
The call graph for this function is shown at Figure 8.13.
For reability, the
debugging sections has been omitted from this function but they are almost identical
to the debugging section during object allocation. See Section H.3.1.1 for how the
markers and poison pattern are checked.
555 static void kmem_slab_destroy (kmem_cache_t *cachep, slab_t *slabp)
556 {
557
if (cachep->dtor
561
) {
562
int i;
563
for (i = 0; i < cachep->num; i++) {
564
void* objp = slabp->s_mem+cachep->objsize*i;
565-574 DEBUG: Check red zone markers
575
576
if (cachep->dtor)
(cachep->dtor)(objp, cachep, 0);
577-584 DEBUG: Check poison pattern
585
586
587
588
589
590
591 }
}
}
kmem_freepages(cachep, slabp->s_mem-slabp->colouroff);
if (OFF_SLAB(cachep))
kmem_cache_free(cachep->slabp_cache, slabp);
557-586 If a destructor is available, call it for each object in the slab
563-585 Cycle through each object in the slab
H.2.3 Slab Destroying (kmem_slab_destroy())
471
564 Calculate the address of the object to destroy
575-576 Call the destructor
588 Free the pages been used for the slab
589
If the slab descriptor is been kept o-slab, then free the memory been used
for it
H.3 Objects
472
H.3 Objects
Contents
H.3 Objects
H.3.1 Initialising Objects in a Slab
H.3.1.1 Function: kmem_cache_init_objs()
H.3.2 Object Allocation
H.3.2.1 Function: kmem_cache_alloc()
H.3.2.2 Function: __kmem_cache_alloc (UP Case)()
H.3.2.3 Function: __kmem_cache_alloc (SMP Case)()
H.3.2.4 Function: kmem_cache_alloc_head()
H.3.2.5 Function: kmem_cache_alloc_one()
H.3.2.6 Function: kmem_cache_alloc_one_tail()
H.3.2.7 Function: kmem_cache_alloc_batch()
H.3.3 Object Freeing
H.3.3.1 Function: kmem_cache_free()
H.3.3.2 Function: __kmem_cache_free (UP Case)()
H.3.3.3 Function: __kmem_cache_free (SMP Case)()
H.3.3.4 Function: kmem_cache_free_one()
H.3.3.5 Function: free_block()
H.3.3.6 Function: __free_block()
This section will cover how objects are managed.
472
472
472
474
474
475
476
477
478
479
480
482
482
482
483
484
485
486
At this point, most of the real
hard work has been completed by either the cache or slab managers.
H.3.1 Initialising Objects in a Slab
H.3.1.1
Function: kmem_cache_init_objs()
(mm/slab.c)
The vast part of this function is involved with debugging so we will start with
the function without the debugging and explain that in detail before handling the
debugging part. The two sections that are debugging are marked in the code excerpt
below as Part 1 and Part 2.
1058 static inline void kmem_cache_init_objs (kmem_cache_t * cachep,
1059
slab_t * slabp, unsigned long ctor_flags)
1060 {
1061
int i;
1062
1063
for (i = 0; i < cachep->num; i++) {
1064
void* objp = slabp->s_mem+cachep->objsize*i;
1065-1072
1079
1080
/* Debugging Part 1 */
if (cachep->ctor)
cachep->ctor(objp, cachep, ctor_flags);
H.3.1 Initialising Objects in a Slab (kmem_cache_init_objs())
1081-1094
1095
1096
1097
1098
1099 }
473
/* Debugging Part 2 */
slab_bufctl(slabp)[i] = i+1;
}
slab_bufctl(slabp)[i-1] = BUFCTL_END;
slabp->free = 0;
1058 The parameters of the function are
cachep The cache the objects are been initialised for
slabp The slab the objects are in
ctor_ags Flags the constructor needs whether this is an atomic allocation
or not
1063 Initialise cache→num number of objects
1064 The base address for objects in the slab is s_mem.
to allocate is then
The address of the object
i * (size of a single object)
1079-1080 If a constructor is available, call it
1095 The macro slab_bufctl() casts slabp to a slab_t slab descriptor and adds
one to it. This brings the pointer to the end of the slab descriptor and then
casts it back to a
kmem_bufctl_t eectively giving the beginning of the bufctl
array.
1098 The index of the rst free object is 0 in the bufctl array
That covers the core of initialising objects. Next the rst debugging part will be
covered
1065 #if DEBUG
1066
if (cachep->flags & SLAB_RED_ZONE) {
1067
*((unsigned long*)(objp)) = RED_MAGIC1;
1068
*((unsigned long*)(objp + cachep->objsize 1069
BYTES_PER_WORD)) = RED_MAGIC1;
1070
objp += BYTES_PER_WORD;
1071
}
1072 #endif
1066 If the cache is to be red zones then place a marker at either end of the object
1067 Place the marker at the beginning of the object
1068
Place the marker at the end of the object. Remember that the size of the
object takes into account the size of the red markers when red zoning is enabled
H.3.2 Object Allocation
1070
474
Increase the objp pointer by the size of the marker for the benet of the
constructor which is called after this debugging block
1081 #if DEBUG
1082
if (cachep->flags & SLAB_RED_ZONE)
1083
objp -= BYTES_PER_WORD;
1084
if (cachep->flags & SLAB_POISON)
1086
kmem_poison_obj(cachep, objp);
1087
if (cachep->flags & SLAB_RED_ZONE) {
1088
if (*((unsigned long*)(objp)) != RED_MAGIC1)
1089
BUG();
1090
if (*((unsigned long*)(objp + cachep->objsize 1091
BYTES_PER_WORD)) != RED_MAGIC1)
1092
BUG();
1093
}
1094 #endif
This is the debugging block that takes place after the constructor, if it exists,
has been called.
1082-1083
The
objp
pointer was increased by the size of the red marker in the
previous debugging block so move it back again
1084-1086
If there was no constructor, poison the object with a known pattern
that can be examined later to trap uninitialised writes
1088
Check to make sure the red marker at the beginning of the object was pre-
served to trap writes before the object
1090-1091 Check to make sure writes didn't take place past the end of the object
H.3.2 Object Allocation
H.3.2.1
Function: kmem_cache_alloc()
(mm/slab.c)
The call graph for this function is shown in Figure 8.14. This trivial function
simply calls
__kmem_cache_alloc().
1529 void * kmem_cache_alloc (kmem_cache_t *cachep, int flags)
1531 {
1532
return __kmem_cache_alloc(cachep, flags);
1533 }
H.3.2.2 Function: __kmem_cache_alloc (UP Case)()
H.3.2.2
Function: __kmem_cache_alloc (UP Case)()
475
(mm/slab.c)
This will take the parts of the function specic to the UP case. The SMP case
will be dealt with in the next section.
1338 static inline void * __kmem_cache_alloc (kmem_cache_t *cachep,
int flags)
1339 {
1340
unsigned long save_flags;
1341
void* objp;
1342
1343
kmem_cache_alloc_head(cachep, flags);
1344 try_again:
1345
local_irq_save(save_flags);
1367
objp = kmem_cache_alloc_one(cachep);
1369
local_irq_restore(save_flags);
1370
return objp;
1371 alloc_new_slab:
1376
1377
1381
1382
1383 }
local_irq_restore(save_flags);
if (kmem_cache_grow(cachep, flags))
goto try_again;
return NULL;
1338 The parameters are the cache to allocate from and allocation specic ags
1343
This function makes sure the appropriate combination of DMA ags are in
use
1345 Disable interrupts and save the ags.
This function is used by interrupts so
this is the only way to provide synchronisation in the UP case
1367 kmem_cache_alloc_one() (see Section H.3.2.5) allocates an object from one
of the lists and returns it. If no objects are free, this macro (note it isn't a
function) will goto alloc_new_slab at the end of this function
1369-1370 Restore interrupts and return
1376
At this label, no objects were free in
slabs_partial
empty so a new slab is needed
1377 Allocate a new slab (see Section 8.2.2)
1379 A new slab is available so try again
1382 No slabs could be allocated so return failure
and
slabs_free
is
H.3.2.3 Function: __kmem_cache_alloc (SMP Case)()
H.3.2.3
Function: __kmem_cache_alloc (SMP Case)()
476
(mm/slab.c)
This is what the function looks like in the SMP case
1338 static inline void * __kmem_cache_alloc (kmem_cache_t *cachep,
int flags)
1339 {
1340
unsigned long save_flags;
1341
void* objp;
1342
1343
kmem_cache_alloc_head(cachep, flags);
1344 try_again:
1345
local_irq_save(save_flags);
1347
{
1348
cpucache_t *cc = cc_data(cachep);
1349
1350
if (cc) {
1351
if (cc->avail) {
1352
STATS_INC_ALLOCHIT(cachep);
1353
objp = cc_entry(cc)[--cc->avail];
1354
} else {
1355
STATS_INC_ALLOCMISS(cachep);
1356
objp =
kmem_cache_alloc_batch(cachep,cc,flags);
1357
if (!objp)
1358
goto alloc_new_slab_nolock;
1359
}
1360
} else {
1361
spin_lock(&cachep->spinlock);
1362
objp = kmem_cache_alloc_one(cachep);
1363
spin_unlock(&cachep->spinlock);
1364
}
1365
}
1366
local_irq_restore(save_flags);
1370
return objp;
1371 alloc_new_slab:
1373
spin_unlock(&cachep->spinlock);
1374 alloc_new_slab_nolock:
1375
local_irq_restore(save_flags);
1377
if (kmem_cache_grow(cachep, flags))
1381
goto try_again;
1382
return NULL;
1383 }
1338-1347 Same as UP case
1349 Obtain the per CPU data for this cpu
H.3.2 Object Allocation (__kmem_cache_alloc (SMP Case)())
477
1350-1360 If a per CPU cache is available then ....
1351 If there is an object available then ....
1352 Update statistics for this cache if enabled
1353 Get an object and update the avail gure
1354 Else an object is not available so ....
1355 Update statistics for this cache if enabled
1356 Allocate batchcount number of objects, place all but one of them in the per
CPU cache and return the last one to objp
1357-1358
The allocation failed, so goto
alloc_new_slab_nolock
to grow the
cache and allocate a new slab
1360-1364
If a per CPU cache is not available, take out the cache spinlock and
allocate one object in the same way the UP case does. This is the case during
the initialisation for the
cache_cache
for example
1363 Object was successfully assigned, release cache spinlock
1366-1370 Re-enable interrupts and return the allocated object
1371-1372
If
kmem_cache_alloc_one()
failed to allocate an object, it will goto
here with the spinlock still held so it must be released
1375-1383 Same as the UP case
H.3.2.4
Function: kmem_cache_alloc_head()
(mm/slab.c)
This simple function ensures the right combination of slab and GFP ags are
used for allocation from a slab. If a cache is for DMA use, this function will make
sure the caller does not accidently request normal memory and vice versa
1231 static inline void kmem_cache_alloc_head(kmem_cache_t *cachep,
int flags)
1232 {
1233
if (flags & SLAB_DMA) {
1234
if (!(cachep->gfpflags & GFP_DMA))
1235
BUG();
1236
} else {
1237
if (cachep->gfpflags & GFP_DMA)
1238
BUG();
1239
}
1240 }
H.3.2 Object Allocation (kmem_cache_alloc_head())
478
1231 The parameters are the cache we are allocating from and the ags requested
for the allocation
1233 If the caller has requested memory for DMA use and ....
1234 The cache is not using DMA memory then BUG()
1237 Else if the caller has not requested DMA memory and this cache is for DMA
use,
H.3.2.5
BUG()
Function: kmem_cache_alloc_one()
(mm/slab.c)
This is a preprocessor macro. It may seem strange to not make this an inline function but it is a preprocessor macro for a goto optimisation in
__kmem_cache_alloc()
(see Section H.3.2.2)
1283 #define kmem_cache_alloc_one(cachep)
1284 ({
1285
struct list_head * slabs_partial, * entry;
1286
slab_t *slabp;
1287
1288
slabs_partial = &(cachep)->slabs_partial;
1289
entry = slabs_partial->next;
1290
if (unlikely(entry == slabs_partial)) {
1291
struct list_head * slabs_free;
1292
slabs_free = &(cachep)->slabs_free;
1293
entry = slabs_free->next;
1294
if (unlikely(entry == slabs_free))
1295
goto alloc_new_slab;
1296
list_del(entry);
1297
list_add(entry, slabs_partial);
1298
}
1299
1300
slabp = list_entry(entry, slab_t, list);
1301
kmem_cache_alloc_one_tail(cachep, slabp);
1302 })
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
1288-1289 Get the rst slab from the slabs_partial list
1290-1298 If a slab is not available from this list, execute this block
1291-1293 Get the rst slab from the slabs_free list
1294-1295 If there is no slabs on slabs_free, then goto alloc_new_slab().
goto label is in
__kmem_cache_alloc()
This
and it is will grow the cache by one
slab
1296-1297 Else remove the slab from the free list and place it on the slabs_partial
list because an object is about to be removed from it
H.3.2 Object Allocation (kmem_cache_alloc_one())
479
1300 Obtain the slab from the list
1301 Allocate one object from the slab
H.3.2.6
Function: kmem_cache_alloc_one_tail()
(mm/slab.c)
This function is responsible for the allocation of one object from a slab. Much
of it is debugging code.
1242 static inline void * kmem_cache_alloc_one_tail (
kmem_cache_t *cachep,
1243
slab_t *slabp)
1244 {
1245
void *objp;
1246
1247
STATS_INC_ALLOCED(cachep);
1248
STATS_INC_ACTIVE(cachep);
1249
STATS_SET_HIGH(cachep);
1250
1252
slabp->inuse++;
1253
objp = slabp->s_mem + slabp->free*cachep->objsize;
1254
slabp->free=slab_bufctl(slabp)[slabp->free];
1255
1256
if (unlikely(slabp->free == BUFCTL_END)) {
1257
list_del(&slabp->list);
1258
list_add(&slabp->list, &cachep->slabs_full);
1259
}
1260 #if DEBUG
1261
if (cachep->flags & SLAB_POISON)
1262
if (kmem_check_poison_obj(cachep, objp))
1263
BUG();
1264
if (cachep->flags & SLAB_RED_ZONE) {
1266
if (xchg((unsigned long *)objp, RED_MAGIC2) !=
1267
RED_MAGIC1)
1268
BUG();
1269
if (xchg((unsigned long *)(objp+cachep->objsize 1270
BYTES_PER_WORD), RED_MAGIC2) != RED_MAGIC1)
1271
BUG();
1272
objp += BYTES_PER_WORD;
1273
}
1274 #endif
1275
return objp;
1276 }
1230 The parameters are the cache and slab been allocated from
H.3.2 Object Allocation (kmem_cache_alloc_one_tail())
1247-1249
480
If stats are enabled, this will set three statistics.
ACTIVE
number of objects that have been allocated.
objects in the cache.
ALLOCED
is the total
is the number of active
HIGH is the maximum number of objects that were active
as a single time
1252 inuse is the number of objects active on this slab
1253
Get a pointer to a free object.
slab.
free
s_mem
is a pointer to the rst object on the
is an index of a free object in the slab.
index * object size
gives an oset within the slab
1254 This updates the free pointer to be an index of the next free object
1256-1259
on the
If the slab is full, remove it from the
slabs_full.
slabs_partial
list and place it
1260-1274 Debugging code
1275 Without debugging, the object is returned to the caller
1261-1263
If the object was poisoned with a known pattern, check it to guard
against uninitialised access
1266-1267
If red zoning was enabled, check the marker at the beginning of the
object and conrm it is safe. Change the red marker to check for writes before
the object later
1269-1271
Check the marker at the end of the object and change it to check for
writes after the object later
1272 Update the object pointer to point to after the red marker
1275 Return the object
H.3.2.7
Function: kmem_cache_alloc_batch()
(mm/slab.c)
This function allocate a batch of objects to a CPU cache of objects. It is only used
in the SMP case. In many ways it is very similar
kmem_cache_alloc_one()(See
1305 void* kmem_cache_alloc_batch(kmem_cache_t* cachep,
cpucache_t* cc, int flags)
1306 {
1307
int batchcount = cachep->batchcount;
1308
1309
spin_lock(&cachep->spinlock);
1310
while (batchcount--) {
1311
struct list_head * slabs_partial, * entry;
1312
slab_t *slabp;
1313
/* Get slab alloc is to come from. */
Section H.3.2.5).
H.3.2 Object Allocation (kmem_cache_alloc_batch())
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335 }
481
slabs_partial = &(cachep)->slabs_partial;
entry = slabs_partial->next;
if (unlikely(entry == slabs_partial)) {
struct list_head * slabs_free;
slabs_free = &(cachep)->slabs_free;
entry = slabs_free->next;
if (unlikely(entry == slabs_free))
break;
list_del(entry);
list_add(entry, slabs_partial);
}
slabp = list_entry(entry, slab_t, list);
cc_entry(cc)[cc->avail++] =
kmem_cache_alloc_one_tail(cachep, slabp);
}
spin_unlock(&cachep->spinlock);
if (cc->avail)
return cc_entry(cc)[--cc->avail];
return NULL;
1305 The parameters are the cache to allocate from, the per CPU cache to ll and
allocation ags
1307 batchcount is the number of objects to allocate
1309 Obtain the spinlock for access to the cache descriptor
1310-1329 Loop batchcount times
1311-1324 This is example the same as kmem_cache_alloc_one()(See Section H.3.2.5).
It selects a slab from either
slabs_partial
or
slabs_free
to allocate from.
If none are available, break out of the loop
1326-1327
Call
kmem_cache_alloc_one_tail()
(See Section H.3.2.6) and place
it in the per CPU cache
1330 Release the cache descriptor lock
1332-1333 Take one of the objects allocated in this batch and return it
1334 If no object was allocated, return. __kmem_cache_alloc() (See Section H.3.2.2)
will grow the cache by one slab and try again
H.3.3 Object Freeing
482
H.3.3 Object Freeing
H.3.3.1
Function: kmem_cache_free()
(mm/slab.c)
The call graph for this function is shown in Figure 8.15.
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
void kmem_cache_free (kmem_cache_t *cachep, void *objp)
{
unsigned long flags;
#if DEBUG
CHECK_PAGE(virt_to_page(objp));
if (cachep != GET_PAGE_CACHE(virt_to_page(objp)))
BUG();
#endif
}
local_irq_save(flags);
__kmem_cache_free(cachep, objp);
local_irq_restore(flags);
1576 The parameter is the cache the object is been freed from and the object itself
1579-1583 If debugging is enabled, the page will rst be checked with CHECK_PAGE()
to make sure it is a slab page. Secondly the page list will be examined to make
sure it belongs to this cache (See Figure 8.8)
1585 Interrupts are disabled to protect the path
1586 __kmem_cache_free()
(See Section H.3.3.2) will free the object to the per-
CPU cache for the SMP case and to the global pool in the normal case
1587 Re-enable interrupts
H.3.3.2
Function: __kmem_cache_free (UP Case)()
(mm/slab.c)
This covers what the function looks like in the UP case. Clearly, it simply releases
the object to the slab.
1493 static inline void __kmem_cache_free (kmem_cache_t *cachep,
void* objp)
1494 {
1517
kmem_cache_free_one(cachep, objp);
1519 }
H.3.3.3 Function: __kmem_cache_free (SMP Case)()
H.3.3.3
Function: __kmem_cache_free (SMP Case)()
483
(mm/slab.c)
This case is slightly more interesting. In this case, the object is released to the
per-cpu cache if it is available.
1493 static inline void __kmem_cache_free (kmem_cache_t *cachep,
void* objp)
1494 {
1496
cpucache_t *cc = cc_data(cachep);
1497
1498
CHECK_PAGE(virt_to_page(objp));
1499
if (cc) {
1500
int batchcount;
1501
if (cc->avail < cc->limit) {
1502
STATS_INC_FREEHIT(cachep);
1503
cc_entry(cc)[cc->avail++] = objp;
1504
return;
1505
}
1506
STATS_INC_FREEMISS(cachep);
1507
batchcount = cachep->batchcount;
1508
cc->avail -= batchcount;
1509
free_block(cachep,
1510
&cc_entry(cc)[cc->avail],batchcount);
1511
cc_entry(cc)[cc->avail++] = objp;
1512
return;
1513
} else {
1514
free_block(cachep, &objp, 1);
1515
}
1519 }
1496 Get the data for this per CPU cache (See Section 8.5.1)
1498 Make sure the page is a slab page
1499-1513
If a per-CPU cache is available, try to use it.
available.
This is not always
During cache destruction for instance, the per CPU caches are
already gone
1501-1505
If the number of available in the per CPU cache is below limit, then
add the object to the free list and return
1506 Update statistics if enabled
1507 The pool has overowed so batchcount number of objects is going to be freed
to the global pool
1508 Update the number of available (avail) objects
1509-1510 Free a block of objects to the global cache
H.3.3 Object Freeing (__kmem_cache_free (SMP Case)())
484
1511 Free the requested object and place it on the per CPU pool
1513 If the per-CPU cache is not available, then free this object to the global pool
H.3.3.4
Function: kmem_cache_free_one()
(mm/slab.c)
1414 static inline void kmem_cache_free_one(kmem_cache_t *cachep,
void *objp)
1415 {
1416
slab_t* slabp;
1417
1418
CHECK_PAGE(virt_to_page(objp));
1425
slabp = GET_PAGE_SLAB(virt_to_page(objp));
1426
1427 #if DEBUG
1428
if (cachep->flags & SLAB_DEBUG_INITIAL)
1433
cachep->ctor(objp, cachep,
SLAB_CTOR_CONSTRUCTOR|SLAB_CTOR_VERIFY);
1434
1435
if (cachep->flags & SLAB_RED_ZONE) {
1436
objp -= BYTES_PER_WORD;
1437
if (xchg((unsigned long *)objp, RED_MAGIC1) !=
RED_MAGIC2)
1438
BUG();
1440
if (xchg((unsigned long *)(objp+cachep->objsize 1441
BYTES_PER_WORD), RED_MAGIC1) !=
RED_MAGIC2)
1443
BUG();
1444
}
1445
if (cachep->flags & SLAB_POISON)
1446
kmem_poison_obj(cachep, objp);
1447
if (kmem_extra_free_checks(cachep, slabp, objp))
1448
return;
1449 #endif
1450
{
1451
unsigned int objnr = (objp-slabp->s_mem)/cachep->objsize;
1452
1453
slab_bufctl(slabp)[objnr] = slabp->free;
1454
slabp->free = objnr;
1455
}
1456
STATS_DEC_ACTIVE(cachep);
1457
1459
{
1460
int inuse = slabp->inuse;
1461
if (unlikely(!--slabp->inuse)) {
H.3.3 Object Freeing (kmem_cache_free_one())
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471 }
485
/* Was partial or full, now empty. */
list_del(&slabp->list);
list_add(&slabp->list, &cachep->slabs_free);
} else if (unlikely(inuse == cachep->num)) {
/* Was full. */
list_del(&slabp->list);
list_add(&slabp->list, &cachep->slabs_partial);
}
}
1418 Make sure the page is a slab page
1425 Get the slab descriptor for the page
1427-1449 Debugging material.
Discussed at end of section
1451 Calculate the index for the object been freed
1454 As this object is now free, update the bufctl to reect that
1456 If statistics are enabled, disable the number of active objects in the slab
1461-1464 If inuse reaches 0, the slab is free and is moved to the slabs_free list
1465-1468
If the number in use equals the number of objects in a slab, it is full
so move it to the
slabs_full
list
1471 End of function
1428-1433
If
SLAB_DEBUG_INITIAL
is set, the constructor is called to verify the
object is in an initialised state
1435-1444
Verify the red marks at either end of the object are still there. This
will check for writes beyond the boundaries of the object and for double frees
1445-1446 Poison the freed object with a known pattern
1447-1448
This function will conrm the object is a part of this slab and cache.
It will then check the free list (bufctl) to make sure this is not a double free
H.3.3.5
Function: free_block()
(mm/slab.c)
This function is only used in the SMP case when the per CPU cache gets too
full. It is used to free a batch of objects in bulk
1481 static void free_block (kmem_cache_t* cachep, void** objpp,
int len)
1482 {
1483
spin_lock(&cachep->spinlock);
H.3.3 Object Freeing (free_block())
1484
1485
1486 }
486
__free_block(cachep, objpp, len);
spin_unlock(&cachep->spinlock);
1481 The parameters are;
cachep The cache that objects are been freed from
objpp Pointer to the rst object to free
len The number of objects to free
1483 Acquire a lock to the cache descriptor
1486 __free_block()(See Section H.3.3.6) performs the actual task of freeing up
each of the pages
1487 Release the lock
H.3.3.6
Function: __free_block()
(mm/slab.c)
This function is responsible for freeing each of the objects in the per-CPU array
objpp.
1474 static inline void __free_block (kmem_cache_t* cachep,
1475
void** objpp, int len)
1476 {
1477
for ( ; len > 0; len--, objpp++)
1478
kmem_cache_free_one(cachep, *objpp);
1479 }
1474 The parameters are the cachep the objects belong to, the list of objects(objpp)
and the number of objects to free (len)
1477 Loop len number of times
1478 Free an object from the array
H.4 Sizes Cache
487
H.4 Sizes Cache
Contents
H.4 Sizes Cache
H.4.1 Initialising the Sizes Cache
H.4.1.1 Function: kmem_cache_sizes_init()
H.4.2 kmalloc()
H.4.2.1 Function: kmalloc()
H.4.3 kfree()
H.4.3.1 Function: kfree()
487
487
487
488
488
489
489
H.4.1 Initialising the Sizes Cache
H.4.1.1
Function: kmem_cache_sizes_init()
(mm/slab.c)
This function is responsible for creating pairs of caches for small memory buers
suitable for either normal or DMA memory.
436 void __init kmem_cache_sizes_init(void)
437 {
438
cache_sizes_t *sizes = cache_sizes;
439
char name[20];
440
444
if (num_physpages > (32 << 20) >> PAGE_SHIFT)
445
slab_break_gfp_order = BREAK_GFP_ORDER_HI;
446
do {
452
snprintf(name, sizeof(name), "size-%Zd",
sizes->cs_size);
453
if (!(sizes->cs_cachep =
454
kmem_cache_create(name, sizes->cs_size,
455
0, SLAB_HWCACHE_ALIGN, NULL, NULL))) {
456
BUG();
457
}
458
460
if (!(OFF_SLAB(sizes->cs_cachep))) {
461
offslab_limit = sizes->cs_size-sizeof(slab_t);
462
offslab_limit /= 2;
463
}
464
snprintf(name, sizeof(name), "size-%Zd(DMA)",
sizes->cs_size);
465
sizes->cs_dmacachep = kmem_cache_create(name,
sizes->cs_size, 0,
466
SLAB_CACHE_DMA|SLAB_HWCACHE_ALIGN,
NULL, NULL);
467
if (!sizes->cs_dmacachep)
468
BUG();
H.4.2 kmalloc()
469
470
471 }
488
sizes++;
} while (sizes->cs_size);
438 Get a pointer to the cache_sizes array
439
The human readable name of the cache .
Should be sized
CACHE_NAMELEN
which is dened to be 20 bytes long
444-445 slab_break_gfp_order determines how many pages a slab may use unless 0 objects t into the slab. It is statically initialised to
BREAK_GFP_ORDER_LO
(1). This check sees if more than 32MiB of memory is available and if it is,
allow
BREAK_GFP_ORDER_HI number of pages to be used because internal frag-
mentation is more acceptable when more memory is available.
446-470 Create two caches for each size of memory allocation needed
452 Store the human readable cache name in name
453-454 Create the cache, aligned to the L1 cache
460-463 Calculate the o-slab bufctl limit which determines the number of objects
that can be stored in a cache when the slab descriptor is kept o-cache.
464 The human readable name for the cache for DMA use
465-466 Create the cache, aligned to the L1 cache and suitable for DMA user
467 if the cache failed to allocate, it is a bug.
If memory is unavailable this early,
the machine will not boot
469 Move to the next element in the cache_sizes array
470 The array is terminated with a 0 as the last element
H.4.2
kmalloc()
H.4.2.1
Function: kmalloc()
(mm/slab.c)
Ths call graph for this function is shown in Figure 8.16.
1555 void * kmalloc (size_t size, int flags)
1556 {
1557
cache_sizes_t *csizep = cache_sizes;
1558
1559
for (; csizep->cs_size; csizep++) {
1560
if (size > csizep->cs_size)
1561
continue;
1562
return __kmem_cache_alloc(flags & GFP_DMA ?
1563
csizep->cs_dmacachep :
H.4.3 kfree()
489
csizep->cs_cachep, flags);
}
return NULL;
1564
1565
1566 }
1557 cache_sizes is the array of caches for each size (See Section 8.4)
1559-1564
Starting with the smallest cache, examine the size of each cache until
one large enough to satisfy the request is found
1562 If the allocation is for use with DMA, allocate an object from cs_dmacachep
else use the
1565
cs_cachep
If a sizes cache of sucient size was not available or an object could not be
allocated, return failure
H.4.3
kfree()
H.4.3.1
Function: kfree()
(mm/slab.c)
The call graph for this function is shown in Figure 8.17. It is worth noting that
the work this function does is almost identical to the function
kmem_cache_free()
with debugging enabled (See Section H.3.3.1).
1597 void kfree (const void *objp)
1598 {
1599
kmem_cache_t *c;
1600
unsigned long flags;
1601
1602
if (!objp)
1603
return;
1604
local_irq_save(flags);
1605
CHECK_PAGE(virt_to_page(objp));
1606
c = GET_PAGE_CACHE(virt_to_page(objp));
1607
__kmem_cache_free(c, (void*)objp);
1608
local_irq_restore(flags);
1609 }
1602
Return if the pointer is NULL. This is possible if a caller used
and had a catch-all failure routine which called
kfree()
1604 Disable interrupts
1605 Make sure the page this object is in is a slab page
1606 Get the cache this pointer belongs to (See Section 8.2)
1607 Free the memory object
1608 Re-enable interrupts
kmalloc()
immediately
H.5 Per-CPU Object Cache
490
H.5 Per-CPU Object Cache
Contents
H.5 Per-CPU Object Cache
H.5.1 Enabling Per-CPU Caches
H.5.1.1 Function: enable_all_cpucaches()
H.5.1.2 Function: enable_cpucache()
H.5.1.3 Function: kmem_tune_cpucache()
H.5.2 Updating Per-CPU Information
H.5.2.1 Function: smp_call_function_all_cpus()
H.5.2.2 Function: do_ccupdate_local()
H.5.3 Draining a Per-CPU Cache
H.5.3.1 Function: drain_cpu_caches()
490
490
490
491
492
495
495
495
496
496
The structure of the Per-CPU object cache and how objects are added or removed
from them is covered in detail in Sections 8.5.1 and 8.5.2.
H.5.1 Enabling Per-CPU Caches
H.5.1.1
Function: enable_all_cpucaches()
Figure H.1: Call Graph:
(mm/slab.c)
enable_all_cpucaches()
This function locks the cache chain and enables the cpucache for every cache.
This is important after the
cache_cache
and sizes cache have been enabled.
1714 static void enable_all_cpucaches (void)
H.5.1 Enabling Per-CPU Caches (enable_all_cpucaches())
1715 {
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729 }
491
struct list_head* p;
down(&cache_chain_sem);
p = &cache_cache.next;
do {
kmem_cache_t* cachep = list_entry(p, kmem_cache_t, next);
enable_cpucache(cachep);
p = cachep->next.next;
} while (p != &cache_cache.next);
up(&cache_chain_sem);
1718 Obtain the semaphore to the cache chain
1719 Get the rst cache on the chain
1721-1726 Cycle through the whole chain
1722 Get a cache from the chain.
but
cache_cache
This code will skip the rst cache on the chain
doesn't need a cpucache as it is so rarely used
1724 Enable the cpucache
1725 Move to the next cache on the chain
1726 Release the cache chain semaphore
H.5.1.2
Function: enable_cpucache()
(mm/slab.c)
This function calculates what the size of a cpucache should be based on the size
of the objects the cache contains before calling
kmem_tune_cpucache()
the actual allocation.
1693 static void enable_cpucache (kmem_cache_t *cachep)
1694 {
1695
int err;
1696
int limit;
1697
1699
if (cachep->objsize > PAGE_SIZE)
1700
return;
1701
if (cachep->objsize > 1024)
1702
limit = 60;
1703
else if (cachep->objsize > 256)
1704
limit = 124;
which does
H.5.1 Enabling Per-CPU Caches (enable_cpucache())
1705
1706
1707
1708
1709
1710
else
1711
1712 }
492
limit = 252;
err = kmem_tune_cpucache(cachep, limit, limit/2);
if (err)
printk(KERN_ERR
"enable_cpucache failed for %s, error %d.\n",
cachep->name, -err);
1699-1700
If an object is larger than a page, do not create a per-CPU cache as
they are too expensive
1701-1702
If an object is larger than 1KiB, keep the cpu cache below 3MiB in
size. The limit is set to 124 objects to take the size of the cpucache descriptors
into account
1703-1704
For smaller objects, just make sure the cache doesn't go above 3MiB
in size
1708 Allocate the memory for the cpucache
1710-1711 Print out an error message if the allocation failed
H.5.1.3
Function: kmem_tune_cpucache()
(mm/slab.c)
This function is responsible for allocating memory for the cpucaches. For each
CPU on the system, kmalloc gives a block of memory large enough for one cpu cache
ccupdate_struct_t struct. The function smp_call_function_all_cpus()
then calls do_ccupdate_local() which swaps the new information with the old inand lls a
formation in the cache descriptor.
1639 static int kmem_tune_cpucache (kmem_cache_t* cachep,
int limit, int batchcount)
1640 {
1641
ccupdate_struct_t new;
1642
int i;
1643
1644
/*
1645
* These are admin-provided, so we are more graceful.
1646
*/
1647
if (limit < 0)
1648
return -EINVAL;
1649
if (batchcount < 0)
1650
return -EINVAL;
1651
if (batchcount > limit)
1652
return -EINVAL;
H.5.1 Enabling Per-CPU Caches (kmem_tune_cpucache())
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
493
if (limit != 0 && !batchcount)
return -EINVAL;
memset(&new.new,0,sizeof(new.new));
if (limit) {
for (i = 0; i< smp_num_cpus; i++) {
cpucache_t* ccnew;
ccnew = kmalloc(sizeof(void*)*limit+
sizeof(cpucache_t),
GFP_KERNEL);
if (!ccnew)
goto oom;
ccnew->limit = limit;
ccnew->avail = 0;
new.new[cpu_logical_map(i)] = ccnew;
1663
1664
1665
1666
1667
1668
}
1669
}
1670
new.cachep = cachep;
1671
spin_lock_irq(&cachep->spinlock);
1672
cachep->batchcount = batchcount;
1673
spin_unlock_irq(&cachep->spinlock);
1674
1675
smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);
1676
1677
for (i = 0; i < smp_num_cpus; i++) {
1678
cpucache_t* ccold = new.new[cpu_logical_map(i)];
1679
if (!ccold)
1680
continue;
1681
local_irq_disable();
1682
free_block(cachep, cc_entry(ccold), ccold->avail);
1683
local_irq_enable();
1684
kfree(ccold);
1685
}
1686
return 0;
1687 oom:
1688
for (i--; i >= 0; i--)
1689
kfree(new.new[cpu_logical_map(i)]);
1690
return -ENOMEM;
1691 }
1639 The parameters of the function are
cachep The cache this cpucache is been allocated for
limit The total number of objects that can exist in the cpucache
H.5.1 Enabling Per-CPU Caches (kmem_tune_cpucache())
batchcount
cpucache
494
The number of objects to allocate in one batch when the
is empty
1647 The number of objects in the cache cannot be negative
1649 A negative number of objects cannot be allocated in batch
1651 A batch of objects greater than the limit cannot be allocated
1653 A batchcount must be provided if the limit is positive
1656 Zero ll the update struct
1657 If a limit is provided, allocate memory for the cpucache
1658-1668 For every CPU, allocate a cpucache
1661
The amount of memory needed is
limit
number of pointers and the size of
the cpucache descriptor
1663 If out of memory, clean up and exit
1665-1666 Fill in the elds for the cpucache descriptor
1667 Fill in the information for ccupdate_update_t struct
1670 Tell the ccupdate_update_t struct what cache is been updated
1671-1673 Acquire an interrupt safe lock to the cache descriptor and set its batchcount
1675
Get each CPU to update its cpucache information for itself.
This swaps
the old cpucaches in the cache descriptor with the new ones in
do_ccupdate_local()
new
(See Section H.5.2.2)
1677-1685 After smp_call_function_all_cpus() (See Section H.5.2.1),
cpucaches are in
new.
using
the old
This block of code cycles through them all, frees any
objects in them and deletes the old cpucache
1686 Return success
1688 In the event there is no memory, delete all cpucaches that have been allocated
up until this point and return failure
H.5.2 Updating Per-CPU Information
495
H.5.2 Updating Per-CPU Information
H.5.2.1
Function: smp_call_function_all_cpus()
(mm/slab.c)
func() for all CPU's. In the context of the slab allocator,
do_ccupdate_local() and the argument is ccupdate_struct_t.
This calls the function
the function is
859 static void smp_call_function_all_cpus(void (*func) (void *arg),
void *arg)
860 {
861
local_irq_disable();
862
func(arg);
863
local_irq_enable();
864
865
if (smp_call_function(func, arg, 1, 1))
866
BUG();
867 }
861-863 Disable interrupts locally and call the function for this CPU
865
For all other CPU's, call the function.
smp_call_function()
is an architec-
ture specic function and will not be discussed further here
H.5.2.2
Function: do_ccupdate_local()
(mm/slab.c)
This function swaps the cpucache information in the cache descriptor with the
information in
info
for this CPU.
874 static void do_ccupdate_local(void *info)
875 {
876
ccupdate_struct_t *new = (ccupdate_struct_t *)info;
877
cpucache_t *old = cc_data(new->cachep);
878
879
cc_data(new->cachep) = new->new[smp_processor_id()];
880
new->new[smp_processor_id()] = old;
881 }
876 info
is a pointer to the ccupdate_struct_t which
smp_call_function_all_cpus()(See Section H.5.2.1)
is then passed to
877 Part of the ccupdate_struct_t is a pointer to the cache this cpucache belongs
to.
cc_data()
returns the
cpucache_t
for this processor
879 Place the new cpucache in cache descriptor. cc_data() returns the pointer to
the cpucache for this CPU.
880 Replace the pointer in new with the old cpucache so it can be deleted later by
the caller of
example
smp_call_function_call_cpus(), kmem_tune_cpucache()
for
H.5.3 Draining a Per-CPU Cache
496
H.5.3 Draining a Per-CPU Cache
This function is called to drain all objects in a per-cpu cache. It is called when a
cache needs to be shrunk for the freeing up of slabs. A slab would not be freeable if
an object was in the per-cpu cache even though it is not in use.
H.5.3.1
Function: drain_cpu_caches()
(mm/slab.c)
885 static void drain_cpu_caches(kmem_cache_t *cachep)
886 {
887
ccupdate_struct_t new;
888
int i;
889
890
memset(&new.new,0,sizeof(new.new));
891
892
new.cachep = cachep;
893
894
down(&cache_chain_sem);
895
smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);
896
897
for (i = 0; i < smp_num_cpus; i++) {
898
cpucache_t* ccold = new.new[cpu_logical_map(i)];
899
if (!ccold || (ccold->avail == 0))
900
continue;
901
local_irq_disable();
902
free_block(cachep, cc_entry(ccold), ccold->avail);
903
local_irq_enable();
904
ccold->avail = 0;
905
}
906
smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);
907
up(&cache_chain_sem);
908 }
890 Blank the update structure as it is going to be clearing all data
892
Set
new.cachep
to cachep so that
smp_call_function_all_cpus()
knows
what cache it is aecting
894 Acquire the cache descriptor semaphore
895 do_ccupdate_local()(See Section H.5.2.2)
tion in the cache descriptor with the ones in
swaps the
new
cpucache_t
informa-
so they can be altered here
897-905 For each CPU in the system ....
898 Get the cpucache descriptor for this CPU
899 If the structure does not exist for some reason or there is no objects available
in it, move to the next CPU
H.5.3 Draining a Per-CPU Cache (drain_cpu_caches())
901
Disable interrupts on this processor.
497
It is possible an allocation from an
interrupt handler elsewhere would try to access the per CPU cache
902 Free the block of objects with free_block() (See Section H.3.3.5)
903 Re-enable interrupts
904 Show that no objects are available
906 The information for each CPU has been updated so call do_ccupdate_local()
(See Section H.5.2.2) for each CPU to put the information back into the cache
descriptor
907 Release the semaphore for the cache chain
H.6 Slab Allocator Initialisation
498
H.6 Slab Allocator Initialisation
Contents
H.6 Slab Allocator Initialisation
H.6.0.2 Function: kmem_cache_init()
H.6.0.2
Function: kmem_cache_init()
498
498
(mm/slab.c)
This function will
•
Initialise the cache chain linked list
•
Initialise a mutex for accessing the cache chain
•
Calculate the
cache_cache
colour
416 void __init kmem_cache_init(void)
417 {
418
size_t left_over;
419
420
init_MUTEX(&cache_chain_sem);
421
INIT_LIST_HEAD(&cache_chain);
422
423
kmem_cache_estimate(0, cache_cache.objsize, 0,
424
&left_over, &cache_cache.num);
425
if (!cache_cache.num)
426
BUG();
427
428
cache_cache.colour = left_over/cache_cache.colour_off;
429
cache_cache.colour_next = 0;
430 }
420 Initialise the semaphore for access the cache chain
421 Initialise the cache chain linked list
423 kmem_cache_estimate()(See Section H.1.2.1)
calculates the number of ob-
jects and amount of bytes wasted
425
If even one
kmem_cache_t
cannot be stored in a page, there is something
seriously wrong
428 colour
is the number of dierent cache lines that can be used while still
keeping L1 cache alignment
429 colour_next indicates which line to use next.
Start at 0
H.7 Interfacing with the Buddy Allocator
499
H.7 Interfacing with the Buddy Allocator
Contents
H.7 Interfacing with the Buddy Allocator
H.7.0.3 Function: kmem_getpages()
H.7.0.4 Function: kmem_freepages()
H.7.0.3
Function: kmem_getpages()
499
499
499
(mm/slab.c)
This allocates pages for the slab allocator
486 static inline void * kmem_getpages (kmem_cache_t *cachep,
unsigned long flags)
487 {
488
void
*addr;
495
flags |= cachep->gfpflags;
496
addr = (void*) __get_free_pages(flags, cachep->gfporder);
503
return addr;
504 }
495 Whatever ags were requested for the allocation, append the cache ags to it.
The only ag it may append is
ZONE_DMA
if the cache requires DMA memory
496 Allocate from the buddy allocator with __get_free_pages() (See Section F.2.3)
503 Return the pages or NULL if it failed
H.7.0.4
Function: kmem_freepages()
(mm/slab.c)
This frees pages for the slab allocator. Before it calls the buddy allocator API,
it will remove the
PG_slab
bit from the page ags.
507 static inline void kmem_freepages (kmem_cache_t *cachep, void *addr)
508 {
509
unsigned long i = (1<<cachep->gfporder);
510
struct page *page = virt_to_page(addr);
511
517
while (i--) {
518
PageClearSlab(page);
519
page++;
520
}
521
free_pages((unsigned long)addr, cachep->gfporder);
522 }
509 Retrieve the order used for the original allocation
510 Get the struct page for the address
517-520 Clear the PG_slab bit on each page
521 Free the pages to the buddy allocator with free_pages() (See Section F.4.1)
Appendix I
High Memory Mangement
Contents
I.1 Mapping High Memory Pages . . . . . . . . . . . . . . . . . . . . 502
I.1.1
I.1.2
I.1.3
I.1.4
I.1.0.5 Function: kmap() . . . . . .
I.1.0.6 Function: kmap_nonblock()
Function: __kmap() . . . . . . . . . .
Function: kmap_high() . . . . . . . .
Function: map_new_virtual() . . . .
Function: flush_all_zero_pkmaps()
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
502
502
502
503
503
506
I.2.1
Function: kmap_atomic() . . . . . . . . . . . . . . . . . . . . . . 508
I.3.1
I.3.2
Function: kunmap() . . . . . . . . . . . . . . . . . . . . . . . . . 510
Function: kunmap_high() . . . . . . . . . . . . . . . . . . . . . . 510
I.4.1
Function: kunmap_atomic() . . . . . . . . . . . . . . . . . . . . . 512
I.5.1
Creating Bounce Buers . . . . . . . . . . . .
I.5.1.1 Function: create_bounce() . . . .
I.5.1.2 Function: alloc_bounce_bh() . . .
I.5.1.3 Function: alloc_bounce_page() . .
Copying via Bounce Buers . . . . . . . . . .
I.5.2.1 Function: bounce_end_io_write()
I.5.2.2 Function: bounce_end_io_read() .
I.5.2.3 Function: copy_from_high_bh() . .
I.5.2.4 Function: copy_to_high_bh_irq()
I.2 Mapping High Memory Pages Atomically . . . . . . . . . . . . . 508
I.3 Unmapping Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
I.4 Unmapping High Memory Pages Atomically . . . . . . . . . . . 512
I.5 Bounce Buers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
I.5.2
500
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
513
513
515
516
517
517
518
518
519
APPENDIX I. HIGH MEMORY MANGEMENT
I.5.2.5
501
Function: bounce_end_io() . . . . . . . . . . . . . . . 519
I.6 Emergency Pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
I.6.1
Function: init_emergency_pool() . . . . . . . . . . . . . . . . . 521
I.1 Mapping High Memory Pages
502
I.1 Mapping High Memory Pages
Contents
I.1 Mapping High Memory Pages
I.1.0.5 Function: kmap()
I.1.0.6 Function: kmap_nonblock()
I.1.1 Function: __kmap()
I.1.2 Function: kmap_high()
I.1.3 Function: map_new_virtual()
I.1.4 Function: flush_all_zero_pkmaps()
I.1.0.5
Function: kmap()
502
502
502
502
503
503
506
(include/asm-i386/highmem.c)
This API is used by callers willing to block.
62 #define kmap(page) __kmap(page, 0)
62 The core function __kmap() is called with the second parameter indicating that
the caller is willing to block
I.1.0.6
Function: kmap_nonblock()
(include/asm-i386/highmem.c)
63 #define kmap_nonblock(page) __kmap(page, 1)
62 The core function __kmap() is called with the second parameter indicating that
the caller is not willing to block
I.1.1
Function: __kmap()
(include/asm-i386/highmem.h)
The call graph for this function is shown in Figure 9.1.
65 static inline void *kmap(struct page *page, int nonblocking)
66 {
67
if (in_interrupt())
68
out_of_line_bug();
69
if (page < highmem_start_page)
70
return page_address(page);
71
return kmap_high(page);
72 }
67-68
This function may not be used from interrupt as it may sleep. Instead of
BUG(), out_of_line_bug() calls do_exit() and returns an error code. BUG()
is not used because BUG() kills the process with extreme prejudice which would
result in the fabled Aiee, killing interrupt handler! kernel panic
69-70 If the page is already in low memory, return a direct mapping
71 Call kmap_high()(See Section I.1.2) for the beginning of the architecture independent work
I.1.2 Function: kmap_high()
I.1.2
132
133
134
135
142
143
144
145
146
147
148
149
150
151
152
153
154
155
Function: kmap_high()
503
(mm/highmem.c)
void *kmap_high(struct page *page, int nonblocking)
{
unsigned long vaddr;
spin_lock(&kmap_lock);
vaddr = (unsigned long) page->virtual;
if (!vaddr) {
vaddr = map_new_virtual(page, nonblocking);
if (!vaddr)
goto out;
}
pkmap_count[PKMAP_NR(vaddr)]++;
if (pkmap_count[PKMAP_NR(vaddr)] < 2)
BUG();
out:
spin_unlock(&kmap_lock);
return (void*) vaddr;
}
142
The
kmap_lock
protects the
virtual
eld of a page and the
pkmap_count
array
143 Get the virtual address of the page
144-148 If it is not already mapped, call map_new_virtual() which will map the
page and return the virtual address. If it fails, goto
out
to free the spinlock
and return NULL
149 Increase the reference count for this page mapping
150-151 If the count is currently less than 2, it is a serious bug.
In reality, severe
breakage would have to be introduced to cause this to happen
153 Free the kmap_lock
I.1.3
Function: map_new_virtual()
(mm/highmem.c)
This function is divided into three principle parts. The scanning for a free slot,
waiting on a queue if none is avaialble and mapping the page.
80 static inline unsigned long map_new_virtual(struct page *page)
81 {
82
unsigned long vaddr;
83
int count;
84
85 start:
I.1 Mapping High Memory Pages (map_new_virtual())
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
504
count = LAST_PKMAP;
/* Find an empty entry */
for (;;) {
last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK;
if (!last_pkmap_nr) {
flush_all_zero_pkmaps();
count = LAST_PKMAP;
}
if (!pkmap_count[last_pkmap_nr])
break; /* Found a usable entry */
if (--count)
continue;
if (nonblocking)
return 0;
86 Start scanning at the last possible slot
88-119 This look keeps scanning and waiting until a slot becomes free.
This allows
the possibility of an innite loop for some processes if they were unlucky
89 last_pkmap_nr is the last pkmap that was scanned.
To prevent searching over
the same pages, this value is recorded so the list is searched circularly. When
it reaches
LAST_PKMAP,
it wraps around to 0
90-93 When last_pkmap_nr wraps around, call flush_all_zero_pkmaps() (See Section I.1.4)
pkmap_count array before ushing
LAST_PKMAP to restart scanning
which will set all entries from 1 to 0 in the
the TLB. Count is set back to
94-95 If this element is 0, a usable slot has been found for the page
96-96 Move to the next index to scan
99-100 The next block of code is going to sleep waiting for a slot to be free.
caller requested that the function not block, return now
105
106
107
108
109
110
111
112
113
114
115
{
DECLARE_WAITQUEUE(wait, current);
current->state = TASK_UNINTERRUPTIBLE;
add_wait_queue(&pkmap_map_wait, &wait);
spin_unlock(&kmap_lock);
schedule();
remove_wait_queue(&pkmap_map_wait, &wait);
spin_lock(&kmap_lock);
/* Somebody else might have mapped it while we
If the
I.1 Mapping High Memory Pages (map_new_virtual())
505
slept */
if (page->virtual)
return (unsigned long) page->virtual;
116
117
118
119
120
121
122
}
}
/* Re-start */
goto start;
If there is no available slot after scanning all the pages once, we sleep on the
pkmap_map_wait
queue until we are woken up after an unmap
106 Declare the wait queue
108 Set the task as interruptible because we are sleeping in kernel space
109 Add ourselves to the pkmap_map_wait queue
110 Free the kmap_lock spinlock
111
Call
schedule()
which will put us to sleep.
We are woken up after a slot
becomes free after an unmap
112 Remove ourselves from the wait queue
113 Re-acquire kmap_lock
116-117
If someone else mapped the page while we slept, just return the address
and the reference count will be incremented by
kmap_high()
120 Restart the scanning
123
124
125
126
127
128
129
130 }
vaddr = PKMAP_ADDR(last_pkmap_nr);
set_pte(&(pkmap_page_table[last_pkmap_nr]), mk_pte(page,
kmap_prot));
pkmap_count[last_pkmap_nr] = 1;
page->virtual = (void *) vaddr;
return vaddr;
A slot has been found, map the page
123 Get the virtual address for the slot found
124 Make the PTE entry with the page and required protection and place it in the
page tables at the found slot
I.1 Mapping High Memory Pages (map_new_virtual())
126
Initialise the value in the
pkmap_count
506
array to 1. The count is incremented
in the parent function and we are sure this is the rst mapping if we are in
this function in the rst place
127 Set the virtual eld for the page
129 Return the virtual address
I.1.4
Function: flush_all_zero_pkmaps()
This function cycles through the
(mm/highmem.c)
pkmap_count array and sets all
entries from 1
to 0 before ushing the TLB.
42 static void flush_all_zero_pkmaps(void)
43 {
44
int i;
45
46
flush_cache_all();
47
48
for (i = 0; i < LAST_PKMAP; i++) {
49
struct page *page;
50
57
if (pkmap_count[i] != 1)
58
continue;
59
pkmap_count[i] = 0;
60
61
/* sanity check */
62
if (pte_none(pkmap_page_table[i]))
63
BUG();
64
72
page = pte_page(pkmap_page_table[i]);
73
pte_clear(&pkmap_page_table[i]);
74
75
page->virtual = NULL;
76
}
77
flush_tlb_all();
78 }
46 As the global page tables are about to change, the CPU caches of all processors
have to be ushed
48-76 Cycle through the entire pkmap_count array
57-58 If the element is not 1, move to the next element
59 Set from 1 to 0
62-63 Make sure the PTE is not somehow mapped
I.1 Mapping High Memory Pages (flush_all_zero_pkmaps())
72-73 Unmap the page from the PTE and clear the PTE
75 Update the virtual eld as the page is unmapped
77 Flush the TLB
507
I.2 Mapping High Memory Pages Atomically
508
I.2 Mapping High Memory Pages Atomically
Contents
I.2 Mapping High Memory Pages Atomically
I.2.1 Function: kmap_atomic()
The following is an example
km_type
508
508
enumeration for the x86. It lists the dierent
uses interrupts have for atomically calling kmap. Note how
KM_TYPE_NR
is the last
element so it doubles up as a count of the number of elements.
4 enum
5
6
7
8
9
10
11
12 };
I.2.1
km_type {
KM_BOUNCE_READ,
KM_SKB_SUNRPC_DATA,
KM_SKB_DATA_SOFTIRQ,
KM_USER0,
KM_USER1,
KM_BH_IRQ,
KM_TYPE_NR
Function: kmap_atomic()
This is the atomic version of
(include/asm-i386/highmem.h)
kmap(). Note that at no point is a
spinlock held
or does it sleep. A spinlock is not required as every processor has its own reserved
space.
89 static inline void *kmap_atomic(struct page *page,
enum km_type type)
90 {
91
enum fixed_addresses idx;
92
unsigned long vaddr;
93
94
if (page < highmem_start_page)
95
return page_address(page);
96
97
idx = type + KM_TYPE_NR*smp_processor_id();
98
vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
99 #if HIGHMEM_DEBUG
100
if (!pte_none(*(kmap_pte-idx)))
101
out_of_line_bug();
102 #endif
103
set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
104
__flush_tlb_one(vaddr);
105
106
return (void*) vaddr;
107 }
I.2 Mapping High Memory Pages Atomically (kmap_atomic())
89
509
The parameters are the page to map and the type of usage required. One slot
per usage per processor is maintained
94-95 If the page is in low memory, return a direct mapping
97 type
gives which slot to use.
KM_TYPE_NR * smp_processor_id()
set of slots reserved for this processor
98 Get the virtual address
100-101 Debugging code.
In reality a PTE will always exist
103 Set the PTE into the reserved slot
104 Flush the TLB for this slot
106 Return the virtual address
gives the
I.3 Unmapping Pages
510
I.3 Unmapping Pages
Contents
I.3 Unmapping Pages
I.3.1 Function: kunmap()
I.3.2 Function: kunmap_high()
I.3.1
Function: kunmap()
510
510
510
(include/asm-i386/highmem.h)
74 static inline void kunmap(struct page *page)
75 {
76
if (in_interrupt())
77
out_of_line_bug();
78
if (page < highmem_start_page)
79
return;
80
kunmap_high(page);
81 }
76-77 kunmap() cannot be called from interrupt so exit gracefully
78-79 If the page already is in low memory, there is no need to unmap
80 Call the architecture independent function kunmap_high()
I.3.2
Function: kunmap_high()
This is the architecture
(mm/highmem.c)
independent part of the kunmap()
operation.
157 void kunmap_high(struct page *page)
158 {
159
unsigned long vaddr;
160
unsigned long nr;
161
int need_wakeup;
162
163
spin_lock(&kmap_lock);
164
vaddr = (unsigned long) page->virtual;
165
if (!vaddr)
166
BUG();
167
nr = PKMAP_NR(vaddr);
168
173
need_wakeup = 0;
174
switch (--pkmap_count[nr]) {
175
case 0:
176
BUG();
177
case 1:
188
need_wakeup = waitqueue_active(&pkmap_map_wait);
189
}
I.3 Unmapping Pages (kunmap_high())
190
191
192
193
194
195 }
511
spin_unlock(&kmap_lock);
/* do wake-up, if needed, race-free outside of the spin lock */
if (need_wakeup)
wake_up(&pkmap_map_wait);
163 Acquire kmap_lock protecting the virtual eld and the pkmap_count array
164 Get the virtual page
165-166
If the virtual eld is not set, it is a double unmapping or unmapping of
a non-mapped page so
BUG()
167 Get the index within the pkmap_count array
173 By default, a wakeup call to processes calling kmap() is not needed
174 Check the value of the index after decrement
175-176
Falling to 0 is a bug as the TLB needs to be ushed to make 0 a valid
entry
177-188 If it has dropped to 1 (the entry is now free but needs a TLB ush), check
to see if there is anyone sleeping on the
pkmap_map_wait
queue. If necessary,
the queue will be woken up after the spinlock is freed
190 Free kmap_lock
193-194 If there are waiters on the queue and a slot has been freed, wake them up
I.4 Unmapping High Memory Pages Atomically
512
I.4 Unmapping High Memory Pages Atomically
Contents
I.4 Unmapping High Memory Pages Atomically
I.4.1 Function: kunmap_atomic()
I.4.1
Function: kunmap_atomic()
512
512
(include/asm-i386/highmem.h)
This entire function is debug code. The reason is that as pages are only mapped
here atomically, they will only be used in a tiny place for a short time before being
unmapped.
It is safe to leave the page there as it will not be referenced after
unmapping and another mapping to the same slot will simply replce it.
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
static inline void kunmap_atomic(void *kvaddr, enum km_type type)
{
#if HIGHMEM_DEBUG
unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
enum fixed_addresses idx = type + KM_TYPE_NR*smp_processor_id();
if (vaddr < FIXADDR_START) // FIXME
return;
if (vaddr != __fix_to_virt(FIX_KMAP_BEGIN+idx))
out_of_line_bug();
/*
* force other mappings to Oops if they'll try to access
* this pte without first remap it
*/
pte_clear(kmap_pte-idx);
__flush_tlb_one(vaddr);
#endif
}
112 Get the virtual address and ensure it is aligned to a page boundary
115-116 If the address supplied is not in the xed area, return
118-119
If the address does not correspond to the reserved slot for this type of
usage and processor, declare it
125-126
Oops
Unmap the page now so that if it is referenced again, it will cause an
I.5 Bounce Buers
513
I.5 Bounce Buers
Contents
I.5 Bounce Buers
I.5.1 Creating Bounce Buers
I.5.1.1 Function: create_bounce()
I.5.1.2 Function: alloc_bounce_bh()
I.5.1.3 Function: alloc_bounce_page()
I.5.2 Copying via Bounce Buers
I.5.2.1 Function: bounce_end_io_write()
I.5.2.2 Function: bounce_end_io_read()
I.5.2.3 Function: copy_from_high_bh()
I.5.2.4 Function: copy_to_high_bh_irq()
I.5.2.5 Function: bounce_end_io()
513
513
513
515
516
517
517
518
518
519
519
I.5.1 Creating Bounce Buers
I.5.1.1
Function: create_bounce()
(mm/highmem.c)
The call graph for this function is shown in Figure 9.3. High level function for
the creation of bounce buers. It is broken into two major parts, the allocation of
the necessary resources, and the copying of data from the template.
405 struct buffer_head * create_bounce(int rw,
struct buffer_head * bh_orig)
406 {
407
struct page *page;
408
struct buffer_head *bh;
409
410
if (!PageHighMem(bh_orig->b_page))
411
return bh_orig;
412
413
bh = alloc_bounce_bh();
420
page = alloc_bounce_page();
421
422
set_bh_page(bh, page, 0);
423
405 The parameters of the function are
rw is set to 1 if this is a write buer
bh_orig is the template buer head to copy from
410-411 If the template buer head is already in low memory, simply return it
413
Allocate a buer head from the slab allocator or from the emergency pool if
it fails
I.5.1 Creating Bounce Buers (create_bounce())
514
420 Allocate a page from the buddy allocator or the emergency pool if it fails
422 Associate the allocated page with the allocated buffer_head
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
bh->b_next = NULL;
bh->b_blocknr = bh_orig->b_blocknr;
bh->b_size = bh_orig->b_size;
bh->b_list = -1;
bh->b_dev = bh_orig->b_dev;
bh->b_count = bh_orig->b_count;
bh->b_rdev = bh_orig->b_rdev;
bh->b_state = bh_orig->b_state;
#ifdef HIGHMEM_DEBUG
bh->b_flushtime = jiffies;
bh->b_next_free = NULL;
bh->b_prev_free = NULL;
/* bh->b_this_page */
bh->b_reqnext = NULL;
bh->b_pprev = NULL;
#endif
/* bh->b_page */
if (rw == WRITE) {
bh->b_end_io = bounce_end_io_write;
copy_from_high_bh(bh, bh_orig);
} else
bh->b_end_io = bounce_end_io_read;
bh->b_private = (void *)bh_orig;
bh->b_rsector = bh_orig->b_rsector;
#ifdef HIGHMEM_DEBUG
memset(&bh->b_wait, -1, sizeof(bh->b_wait));
#endif
}
return bh;
Populate the newly created
431
buffer_head
Copy in information essentially verbatim except for the
b_list
eld as this
buer is not directly connected to the others on the list
433-438 Debugging only information
441-444 If this is a buer that is to be written to then the callback function to end
the IO is
bounce_end_io_write()(See
Section I.5.2.1) which is called when
the device has received all the information. As the data exists in high memory,
it is copied down with
copy_from_high_bh()
(See Section I.5.2.3)
I.5.1 Creating Bounce Buers (create_bounce())
437-438
515
If we are waiting for a device to write data into the buer, then the
callback function
bounce_end_io_read()(See
Section I.5.2.2) is used
446-447 Copy the remaining information from the template buffer_head
452 Return the new bounce buer
I.5.1.2
Function: alloc_bounce_bh()
This function rst tries to allocate a
(mm/highmem.c)
buffer_head from the slab allocator and if
that fails, an emergency pool will be used.
369 struct buffer_head *alloc_bounce_bh (void)
370 {
371
struct list_head *tmp;
372
struct buffer_head *bh;
373
374
bh = kmem_cache_alloc(bh_cachep, SLAB_NOHIGHIO);
375
if (bh)
376
return bh;
380
381
wakeup_bdflush();
374
Try to allocate a new
request is made to
not
buffer_head
from the slab allocator.
Note how the
use IO operations that involve high IO to avoid recursion
375-376 If the allocation was successful, return
381 If it was not, wake up bdush to launder pages
383 repeat_alloc:
387
tmp = &emergency_bhs;
388
spin_lock_irq(&emergency_lock);
389
if (!list_empty(tmp)) {
390
bh = list_entry(tmp->next, struct buffer_head,
b_inode_buffers);
391
list_del(tmp->next);
392
nr_emergency_bhs--;
393
}
394
spin_unlock_irq(&emergency_lock);
395
if (bh)
396
return bh;
397
398
/* we need to wait I/O completion */
399
run_task_queue(&tq_disk);
400
401
yield();
402
goto repeat_alloc;
403 }
I.5.1 Creating Bounce Buers (alloc_bounce_bh())
516
The allocation from the slab failed so allocate from the emergency pool.
387 Get the end of the emergency buer head list
388 Acquire the lock protecting the pools
389-393 If the pool is not empty, take a buffer_head from the list and decrement
the
nr_emergency_bhs
counter
394 Release the lock
395-396 If the allocation was successful, return it
399 If not, we are seriously short of memory and the only way the pool will replenish
is if high memory IO completes. Therefore, requests on
tq_disk
are started
so the data will be written to disk, probably freeing up pages in the process
401 Yield the processor
402 Attempt to allocate from the emergency pools again
I.5.1.3
Function: alloc_bounce_page()
This function is essentially identical to
(mm/highmem.c)
alloc_bounce_bh().
It rst tries to
allocate a page from the buddy allocator and if that fails, an emergency pool will
be used.
333 struct page *alloc_bounce_page (void)
334 {
335
struct list_head *tmp;
336
struct page *page;
337
338
page = alloc_page(GFP_NOHIGHIO);
339
if (page)
340
return page;
344
345
wakeup_bdflush();
338-340 Allocate from the buddy allocator and return the page if successful
345 Wake bdush to launder pages
347 repeat_alloc:
351
tmp = &emergency_pages;
352
spin_lock_irq(&emergency_lock);
353
if (!list_empty(tmp)) {
354
page = list_entry(tmp->next, struct page, list);
355
list_del(tmp->next);
356
nr_emergency_pages--;
I.5.2 Copying via Bounce Buers
357
358
359
360
361
362
363
364
365
366
367 }
517
}
spin_unlock_irq(&emergency_lock);
if (page)
return page;
/* we need to wait I/O completion */
run_task_queue(&tq_disk);
yield();
goto repeat_alloc;
351 Get the end of the emergency buer head list
352 Acquire the lock protecting the pools
353-357
If the pool is not empty, take a page from the list and decrement the
number of available
nr_emergency_pages
358 Release the lock
359-360 If the allocation was successful, return it
363 Run the IO task queue to try and replenish the emergency pool
365 Yield the processor
366 Attempt to allocate from the emergency pools again
I.5.2 Copying via Bounce Buers
I.5.2.1
Function: bounce_end_io_write()
(mm/highmem.c)
This function is called when a bounce buer used for writing to a device completes
IO. As the buer is copied
from
high memory and to the device, there is nothing
left to do except reclaim the resources
319 static void bounce_end_io_write (struct buffer_head *bh,
int uptodate)
320 {
321
bounce_end_io(bh, uptodate);
322 }
I.5.2.2 Function: bounce_end_io_read()
I.5.2.2
518
Function: bounce_end_io_read()
(mm/highmem.c)
This is called when data has been read from the device and needs to be copied
to high memory. It is called from interrupt so has to be more careful
324 static void bounce_end_io_read (struct buffer_head *bh,
int uptodate)
325 {
326
struct buffer_head *bh_orig =
(struct buffer_head *)(bh->b_private);
327
328
if (uptodate)
329
copy_to_high_bh_irq(bh_orig, bh);
330
bounce_end_io(bh, uptodate);
331 }
328-329 The data is just copied to the bounce buer to needs to be moved to high
memory with
copy_to_high_bh_irq()
(See Section I.5.2.4)
330 Reclaim the resources
I.5.2.3
Function: copy_from_high_bh()
This function copies data from a high
(mm/highmem.c)
memory buffer_head to
a bounce buer.
215 static inline void copy_from_high_bh (struct buffer_head *to,
216
struct buffer_head *from)
217 {
218
struct page *p_from;
219
char *vfrom;
220
221
p_from = from->b_page;
222
223
vfrom = kmap_atomic(p_from, KM_USER0);
224
memcpy(to->b_data, vfrom + bh_offset(from), to->b_size);
225
kunmap_atomic(vfrom, KM_USER0);
226 }
223
Map the high memory page into low memory.
the IRQ safe lock
(See Section I.2.1)
224 Copy the data
225 Unmap the page
io_request_lock
This path is protected by
so it is safe to call
kmap_atomic()
I.5.2.4 Function: copy_to_high_bh_irq()
I.5.2.4
Function: copy_to_high_bh_irq()
519
(mm/highmem.c)
Called from interrupt after the device has nished writing data to the bounce
buer. This function copies data to high memory
228 static inline void copy_to_high_bh_irq (struct buffer_head *to,
229
struct buffer_head *from)
230 {
231
struct page *p_to;
232
char *vto;
233
unsigned long flags;
234
235
p_to = to->b_page;
236
__save_flags(flags);
237
__cli();
238
vto = kmap_atomic(p_to, KM_BOUNCE_READ);
239
memcpy(vto + bh_offset(to), from->b_data, to->b_size);
240
kunmap_atomic(vto, KM_BOUNCE_READ);
241
__restore_flags(flags);
242 }
236-237 Save the ags and disable interrupts
238 Map the high memory page into low memory
239 Copy the data
240 Unmap the page
241 Restore the interrupt ags
I.5.2.5
Function: bounce_end_io()
(mm/highmem.c)
Reclaims the resources used by the bounce buers. If emergency pools are depleted, the resources are added to it.
244 static inline void bounce_end_io (struct buffer_head *bh,
int uptodate)
245 {
246
struct page *page;
247
struct buffer_head *bh_orig =
(struct buffer_head *)(bh->b_private);
248
unsigned long flags;
249
250
bh_orig->b_end_io(bh_orig, uptodate);
251
252
page = bh->b_page;
253
I.5.2 Copying via Bounce Buers (bounce_end_io())
520
254
spin_lock_irqsave(&emergency_lock, flags);
255
if (nr_emergency_pages >= POOL_SIZE)
256
__free_page(page);
257
else {
258
/*
259
* We are abusing page->list to manage
260
* the highmem emergency pool:
261
*/
262
list_add(&page->list, &emergency_pages);
263
nr_emergency_pages++;
264
}
265
266
if (nr_emergency_bhs >= POOL_SIZE) {
267 #ifdef HIGHMEM_DEBUG
268
/* Don't clobber the constructed slab cache */
269
init_waitqueue_head(&bh->b_wait);
270 #endif
271
kmem_cache_free(bh_cachep, bh);
272
} else {
273
/*
274
* Ditto in the bh case, here we abuse b_inode_buffers:
275
*/
276
list_add(&bh->b_inode_buffers, &emergency_bhs);
277
nr_emergency_bhs++;
278
}
279
spin_unlock_irqrestore(&emergency_lock, flags);
280 }
250 Call the IO completion callback for the original buffer_head
252 Get the pointer to the buer page to free
254 Acquire the lock to the emergency pool
255-256 If the page pool is full, just return the page to the buddy allocator
257-264 Otherwise add this page to the emergency pool
266-272 If the buffer_head pool is full, just return it to the slab allocator
272-278 Otherwise add this buffer_head to the pool
279 Release the lock
I.6 Emergency Pools
521
I.6 Emergency Pools
Contents
I.6 Emergency Pools
I.6.1 Function: init_emergency_pool()
521
521
There is only one function of relevance to the emergency pools and that is the init
function.
It is called during system startup and then the code is deleted as it is
never needed again
I.6.1
Function: init_emergency_pool()
(mm/highmem.c)
buffer_heads
Create a pool for emergency pages and for emergency
282 static __init int init_emergency_pool(void)
283 {
284
struct sysinfo i;
285
si_meminfo(&i);
286
si_swapinfo(&i);
287
288
if (!i.totalhigh)
289
return 0;
290
291
spin_lock_irq(&emergency_lock);
292
while (nr_emergency_pages < POOL_SIZE) {
293
struct page * page = alloc_page(GFP_ATOMIC);
294
if (!page) {
295
printk("couldn't refill highmem emergency pages");
296
break;
297
}
298
list_add(&page->list, &emergency_pages);
299
nr_emergency_pages++;
300
}
288-289 If there is no high memory available, do not bother
291 Acquire the lock protecting the pools
292-300
Allocate
POOL_SIZE
to a linked list.
pages from the buddy allocator and add them
Keep a count of the number of pages in the pool with
nr_emergency_pages
301
302
303
304
305
while (nr_emergency_bhs < POOL_SIZE) {
struct buffer_head * bh =
kmem_cache_alloc(bh_cachep, SLAB_ATOMIC);
if (!bh) {
printk("couldn't refill highmem emergency bhs");
break;
I.6 Emergency Pools (init_emergency_pool())
306
307
308
309
310
311
522
}
list_add(&bh->b_inode_buffers, &emergency_bhs);
nr_emergency_bhs++;
312
313
314
315 }
}
spin_unlock_irq(&emergency_lock);
printk("allocated %d pages and %d bhs reserved for the
highmem bounces\n",
nr_emergency_pages, nr_emergency_bhs);
return 0;
301-309 Allocate POOL_SIZE buffer_heads from the slab allocator and add them
b_inode_buffers.
nr_emergency_bhs
to a linked list linked by
are in the pool with
310 Release the lock protecting the pools
314 Return success
Keep track of how many heads
Appendix J
Page Frame Reclamation
Contents
J.1 Page Cache Operations . . . . . . . . . . . . . . . . . . . . . . . . 525
J.1.1 Adding Pages to the Page Cache . . . . . . . . . . . . .
J.1.1.1 Function: add_to_page_cache() . . . . . . . .
J.1.1.2 Function: add_to_page_cache_unique() . . .
J.1.1.3 Function: __add_to_page_cache() . . . . . .
J.1.1.4 Function: add_page_to_inode_queue() . . . .
J.1.1.5 Function: add_page_to_hash_queue() . . . .
J.1.2 Deleting Pages from the Page Cache . . . . . . . . . . .
J.1.2.1 Function: remove_inode_page() . . . . . . . .
J.1.2.2 Function: __remove_inode_page() . . . . . .
J.1.2.3 Function: remove_page_from_inode_queue()
J.1.2.4 Function: remove_page_from_hash_queue() .
J.1.3 Acquiring/Releasing Page Cache Pages . . . . . . . . . .
J.1.3.1 Function: page_cache_get() . . . . . . . . . .
J.1.3.2 Function: page_cache_release() . . . . . . .
J.1.4 Searching the Page Cache . . . . . . . . . . . . . . . . .
J.1.4.1 Function: find_get_page() . . . . . . . . . .
J.1.4.2 Function: __find_get_page() . . . . . . . . .
J.1.4.3 Function: __find_page_nolock() . . . . . . .
J.1.4.4 Function: find_lock_page() . . . . . . . . . .
J.1.4.5 Function: __find_lock_page() . . . . . . . .
J.1.4.6 Function: __find_lock_page_helper() . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
525
525
526
527
528
528
529
529
529
530
530
531
531
531
531
531
531
532
533
533
533
J.2 LRU List Operations . . . . . . . . . . . . . . . . . . . . . . . . . 535
J.2.1 Adding Pages to the LRU Lists . . . . . . . . . . . . . . . . . . . 535
523
APPENDIX J. PAGE FRAME RECLAMATION
524
J.2.1.1 Function: lru_cache_add() . . . . . . . . .
J.2.1.2 Function: add_page_to_active_list() . . .
J.2.1.3 Function: add_page_to_inactive_list() .
J.2.2 Deleting Pages from the LRU Lists . . . . . . . . . . .
J.2.2.1 Function: lru_cache_del() . . . . . . . . .
J.2.2.2 Function: __lru_cache_del() . . . . . . . .
J.2.2.3 Function: del_page_from_active_list() .
J.2.2.4 Function: del_page_from_inactive_list()
J.2.3 Activating Pages . . . . . . . . . . . . . . . . . . . . .
J.2.3.1 Function: mark_page_accessed() . . . . . .
J.2.3.2 Function: activate_lock() . . . . . . . . .
J.2.3.3 Function: activate_page_nolock() . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
535
535
536
536
536
537
537
537
538
538
538
538
J.3 Relling inactive_list . . . . . . . . . . . . . . . . . . . . . . . . 540
J.3.1 Function: refill_inactive() . . . . . . . . . . . . . . . . . . . 540
J.4 Reclaiming Pages from the LRU Lists . . . . . . . . . . . . . . . 542
J.4.1 Function: shrink_cache() . . . . . . . . . . . . . . . . . . . . . 542
J.5 Shrinking all caches . . . . . . . . . . . . . . . . . . . . . . . . . . 550
J.5.1 Function: shrink_caches() . . . . . . . . . . . . . . . . . . . . . 550
J.5.2 Function: try_to_free_pages() . . . . . . . . . . . . . . . . . . 551
J.5.3 Function: try_to_free_pages_zone() . . . . . . . . . . . . . . . 552
J.6 Swapping Out Process Pages . . . . . . . . . . . . . . . . . . . . 554
J.6.1
J.6.2
J.6.3
J.6.4
J.6.5
J.6.6
Function:
Function:
Function:
Function:
Function:
Function:
swap_out() . . . . . . . . . . . . . . . . . . . . . . . . 554
swap_out_mm() . . . . . . . . . . . . . . . . . . . . . . 556
swap_out_vma() . . . . . . . . . . . . . . . . . . . . . 557
swap_out_pgd() . . . . . . . . . . . . . . . . . . . . . 558
swap_out_pmd() . . . . . . . . . . . . . . . . . . . . . 559
try_to_swap_out() . . . . . . . . . . . . . . . . . . . 561
J.7 Page Swap Daemon . . . . . . . . . . . . . . . . . . . . . . . . . . 565
J.7.1 Initialising kswapd . . . . . . . . . . . . . . . . .
J.7.1.1 Function: kswapd_init() . . . . . . . .
J.7.2 kswapd Daemon . . . . . . . . . . . . . . . . . .
J.7.2.1 Function: kswapd() . . . . . . . . . . .
J.7.2.2 Function: kswapd_can_sleep() . . . .
J.7.2.3 Function: kswapd_can_sleep_pgdat()
J.7.2.4 Function: kswapd_balance() . . . . . .
J.7.2.5 Function: kswapd_balance_pgdat() . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
565
565
565
565
567
567
568
568
J.1 Page Cache Operations
525
J.1 Page Cache Operations
Contents
J.1 Page Cache Operations
J.1.1 Adding Pages to the Page Cache
J.1.1.1 Function: add_to_page_cache()
J.1.1.2 Function: add_to_page_cache_unique()
J.1.1.3 Function: __add_to_page_cache()
J.1.1.4 Function: add_page_to_inode_queue()
J.1.1.5 Function: add_page_to_hash_queue()
J.1.2 Deleting Pages from the Page Cache
J.1.2.1 Function: remove_inode_page()
J.1.2.2 Function: __remove_inode_page()
J.1.2.3 Function: remove_page_from_inode_queue()
J.1.2.4 Function: remove_page_from_hash_queue()
J.1.3 Acquiring/Releasing Page Cache Pages
J.1.3.1 Function: page_cache_get()
J.1.3.2 Function: page_cache_release()
J.1.4 Searching the Page Cache
J.1.4.1 Function: find_get_page()
J.1.4.2 Function: __find_get_page()
J.1.4.3 Function: __find_page_nolock()
J.1.4.4 Function: find_lock_page()
J.1.4.5 Function: __find_lock_page()
J.1.4.6 Function: __find_lock_page_helper()
525
525
525
526
527
528
528
529
529
529
530
530
531
531
531
531
531
531
532
533
533
533
This section addresses how pages are added and removed from the page cache
and LRU lists, both of which are heavily intertwined.
J.1.1 Adding Pages to the Page Cache
J.1.1.1
Function: add_to_page_cache()
(mm/lemap.c)
__add_to_page_cache()
Acquire the lock protecting the page cache before calling
which will add the page to the page hash table and inode queue which allows the
pages belonging to les to be found quickly.
667 void add_to_page_cache(struct page * page,
struct address_space * mapping,
unsigned long offset)
668 {
669
spin_lock(&pagecache_lock);
670
__add_to_page_cache(page, mapping,
offset, page_hash(mapping, offset));
671
spin_unlock(&pagecache_lock);
672
lru_cache_add(page);
673 }
J.1.1 Adding Pages to the Page Cache (add_to_page_cache())
526
669 Acquire the lock protecting the page hash and inode queues
670 Call the function which performs the real
work
671 Release the lock protecting the hash and inode queue
672
Add the page to the page cache.
table based on the
mapping
page_hash() hashes into the page hash
offset within the le. If a page is
and the
returned, there was a collision and the colliding pages are chained with the
page→next_hash
J.1.1.2
and
page→pprev_hash
elds
Function: add_to_page_cache_unique()
In many respects, this function is very similar to
(mm/lemap.c)
add_to_page_cache().
The
principal dierence is that this function will check the page cache with the
pagecache_lock
spinlock held before adding the page to the cache.
It is for
callers may race with another process for inserting a page in the cache such as
add_to_swap_cache()(See
Section K.2.1.1).
675 int add_to_page_cache_unique(struct page * page,
676
struct address_space *mapping, unsigned long offset,
677
struct page **hash)
678 {
679
int err;
680
struct page *alias;
681
682
spin_lock(&pagecache_lock);
683
alias = __find_page_nolock(mapping, offset, *hash);
684
685
err = 1;
686
if (!alias) {
687
__add_to_page_cache(page,mapping,offset,hash);
688
err = 0;
689
}
690
691
spin_unlock(&pagecache_lock);
692
if (!err)
693
lru_cache_add(page);
694
return err;
695 }
682 Acquire the pagecache_lock for examining the cache
683
Check if the page already exists in the cache with
__find_page_nolock()
(See Section J.1.4.3)
686-689 If the page does not exist in the cache, add it with __add_to_page_cache()
(See Section J.1.1.3)
J.1.1 Adding Pages to the Page Cache (add_to_page_cache_unique())
527
691 Release the pagecache_lock
692-693
If the page did not already exist in the page cache, add it to the LRU
lists with
694
lru_cache_add()(See
Section J.2.1.1)
Return 0 if this call entered the page into the page cache and 1 if it already
existed
J.1.1.3
Function: __add_to_page_cache()
(mm/lemap.c)
Clear all page ags, lock it, take a reference and add it to the inode and hash
queues.
653 static inline void __add_to_page_cache(struct page * page,
654
struct address_space *mapping, unsigned long offset,
655
struct page **hash)
656 {
657
unsigned long flags;
658
659
flags = page->flags & ~(1 << PG_uptodate |
1 << PG_error | 1 << PG_dirty |
1 << PG_referenced | 1 << PG_arch_1 |
1 << PG_checked);
660
page->flags = flags | (1 << PG_locked);
661
page_cache_get(page);
662
page->index = offset;
663
add_page_to_inode_queue(mapping, page);
664
add_page_to_hash_queue(page, hash);
665 }
659 Clear all page ags
660 Lock the page
661 Take a reference to the page in case it gets freed prematurely
662 Update the index so it is known what le oset this page represents
663 Add the page to the inode queue with add_page_to_inode_queue() (See Section J.1.1.4).
page→list to the clean_pages list in the
the page→mapping to the same address_space
This links the page via the
address_space
and points
664 Add it to the page hash with add_page_to_hash_queue() (See Section J.1.1.5).
The
hash page was returned by page_hash() in the parent function.
The page
hash allows page cache pages without having to lineraly search the inode queue
J.1.1.4 Function: add_page_to_inode_queue()
J.1.1.4
Function: add_page_to_inode_queue()
528
(mm/lemap.c)
85 static inline void add_page_to_inode_queue(
struct address_space *mapping, struct page * page)
86 {
87
struct list_head *head = &mapping->clean_pages;
88
89
mapping->nrpages++;
90
list_add(&page->list, head);
91
page->mapping = mapping;
92 }
87
When this function is called, the page is clean, so
mapping→clean_pages
is
the list of interest
89 Increment the number of pages that belong to this mapping
90 Add the page to the clean list
91 Set the page→mapping eld
J.1.1.5
Function: add_page_to_hash_queue()
This adds
page
to the top of hash bucket
an element of the array
page_hash_table.
(mm/lemap.c)
headed by p. Bear in mind
that
p
71 static void add_page_to_hash_queue(struct page * page,
struct page **p)
72 {
73
struct page *next = *p;
74
75
*p = page;
76
page->next_hash = next;
77
page->pprev_hash = p;
78
if (next)
79
next->pprev_hash = &page->next_hash;
80
if (page->buffers)
81
PAGE_BUG(page);
82
atomic_inc(&page_cache_size);
83 }
73 Record the current head of the hash bucket in next
75 Update the head of the hash bucket to be page
76 Point page→next_hash to the old head of the hash bucket
77 Point page→pprev_hash to point to the array element in page_hash_table
is
J.1.2 Deleting Pages from the Page Cache
78-79
This will point the
529
pprev_hash eld to the head
page into the linked list
of the hash bucket com-
pleting the insertion of the
80-81 Check that the page entered has no associated buers
82 Increment page_cache_size which is the size of the page cache
J.1.2 Deleting Pages from the Page Cache
J.1.2.1
Function: remove_inode_page()
(mm/lemap.c)
130 void remove_inode_page(struct page *page)
131 {
132
if (!PageLocked(page))
133
PAGE_BUG(page);
134
135
spin_lock(&pagecache_lock);
136
__remove_inode_page(page);
137
spin_unlock(&pagecache_lock);
138 }
132-133 If the page is not locked, it is a bug
135 Acquire the lock protecting the page cache
136 __remove_inode_page()
(See Section J.1.2.2) is the top-level function for
when the pagecache lock is held
137 Release the pagecache lock
J.1.2.2
Function: __remove_inode_page()
(mm/lemap.c)
This is the top-level function for removing a page from the page cache for callers
with the
pagecache_lock spinlock held.
remove_inode_page().
Callers that do not have this lock acquired
should call
124 void __remove_inode_page(struct page *page)
125 {
126
remove_page_from_inode_queue(page);
127
remove_page_from_hash_queue(page);
128
126 remove_page_from_inode_queue()
from it's
address_space
at
page→mapping
127 remove_page_from_hash_queue()
page_hash_table
(See Section J.1.2.3) remove the page
removes the page from the hash table in
J.1.2.3 Function: remove_page_from_inode_queue()
J.1.2.3
530
Function: remove_page_from_inode_queue() (mm/lemap.c)
94 static inline void remove_page_from_inode_queue(struct page * page)
95 {
96
struct address_space * mapping = page->mapping;
97
98
if (mapping->a_ops->removepage)
99
mapping->a_ops->removepage(page);
100
list_del(&page->list);
101
page->mapping = NULL;
102
wmb();
103
mapping->nr_pages--;
104 }
96 Get the associated address_space for this page
98-99 Call the lesystem specic removepage() function if one is available
100
Delete the page from whatever list it belongs to in the
clean_pages
list in most cases or the
dirty_pages
mapping
such as the
in rarer cases
101 Set the page→mapping to NULL as it is no longer backed by any address_space
103 Decrement the number of pages in the mapping
J.1.2.4
Function: remove_page_from_hash_queue()
(mm/lemap.c)
107 static inline void remove_page_from_hash_queue(struct page * page)
108 {
109
struct page *next = page->next_hash;
110
struct page **pprev = page->pprev_hash;
111
112
if (next)
113
next->pprev_hash = pprev;
114
*pprev = next;
115
page->pprev_hash = NULL;
116
atomic_dec(&page_cache_size);
117 }
109 Get the next page after the page being removed
110 Get the pprev page before the page being removed.
pletes,
pprev
will be linked to
When the function com-
next
112 If this is not the end of the list, update next→pprev_hash to point to pprev
114 Similarly, point pprev forward to next. page is now unlinked
116 Decrement the size of the page cache
J.1.3 Acquiring/Releasing Page Cache Pages
531
J.1.3 Acquiring/Releasing Page Cache Pages
J.1.3.1
Function: page_cache_get()
31 #define page_cache_get(x)
(include/linux/pagemap.h)
get_page(x)
31 Simple call get_page() which simply uses atomic_inc() to increment the page
reference count
J.1.3.2
Function: page_cache_release()
32 #define page_cache_release(x)
(include/linux/pagemap.h)
__free_page(x)
32 Call __free_page() which decrements the page count.
If the count reaches 0,
the page will be freed
J.1.4 Searching the Page Cache
J.1.4.1
Function: find_get_page()
(include/linux/pagemap.h)
Top level macro for nding a page in the page cache. It simply looks up the page
hash
75 #define find_get_page(mapping, index) \
76
__find_get_page(mapping, index, page_hash(mapping, index))
76 page_hash() locates an entry in the page_hash_table based on the address_space
and oset
J.1.4.2
Function: __find_get_page()
(mm/lemap.c)
This function is responsible for nding a struct page given an entry in
page_hash_table
as a starting point.
931 struct page * __find_get_page(struct address_space *mapping,
932
unsigned long offset, struct page **hash)
933 {
934
struct page *page;
935
936
/*
937
* We scan the hash list read-only. Addition to and removal from
938
* the hash-list needs a held write-lock.
939
*/
940
spin_lock(&pagecache_lock);
941
page = __find_page_nolock(mapping, offset, *hash);
942
if (page)
943
page_cache_get(page);
944
spin_unlock(&pagecache_lock);
945
return page;
946 }
J.1.4 Searching the Page Cache (__find_get_page())
532
940 Acquire the read-only page cache lock
941 Call the page cache traversal function which presumes a lock is held
942-943
If the page was found, obtain a reference to it with
page_cache_get()
(See Section J.1.3.1) so it is not freed prematurely
944 Release the page cache lock
945 Return the page or NULL if not found
J.1.4.3
Function: __find_page_nolock()
(mm/lemap.c)
This function traverses the hash collision list looking for the page specied by
the
address_space
and
offset.
443 static inline struct page * __find_page_nolock(
struct address_space *mapping,
unsigned long offset,
struct page *page)
444 {
445
goto inside;
446
447
for (;;) {
448
page = page->next_hash;
449 inside:
450
if (!page)
451
goto not_found;
452
if (page->mapping != mapping)
453
continue;
454
if (page->index == offset)
455
break;
456
}
457
458 not_found:
459
return page;
460 }
445 Begin by examining the rst page in the list
450-451 If the page is NULL, the right one could not be found so return NULL
452 If the address_space does not match, move to the next page on the collision
list
454 If the offset matchs, return it, else move on
448 Move to the next page on the hash list
459 Return the found page or NULL if not
J.1.4.4 Function: find_lock_page()
J.1.4.4
Function: find_lock_page()
533
(include/linux/pagemap.h)
This is the top level function for searching the page cache for a page and having
it returned in a locked state.
84 #define find_lock_page(mapping, index) \
85
__find_lock_page(mapping, index, page_hash(mapping, index))
85 Call the core function __find_lock_page() after looking up what hash bucket
this page is using with
J.1.4.5
page_hash()
Function: __find_lock_page()
(mm/lemap.c)
This function acquires the pagecache_lock spinlock before calling the core function __find_lock_page_helper() to locate the page and lock it.
1005 struct page * __find_lock_page (struct address_space *mapping,
1006
unsigned long offset, struct page **hash)
1007 {
1008
struct page *page;
1009
1010
spin_lock(&pagecache_lock);
1011
page = __find_lock_page_helper(mapping, offset, *hash);
1012
spin_unlock(&pagecache_lock);
1013
return page;
1014 }
1010 Acquire the pagecache_lock spinlock
1011 Call __find_lock_page_helper() which will search the page cache and lock
the page if it is found
1012 Release the pagecache_lock spinlock
1013 If the page was found, return it in a locked state, otherwise return NULL
J.1.4.6
Function: __find_lock_page_helper()
This function uses
__find_page_nolock()
(mm/lemap.c)
to locate a page within the page
cache. If it is found, the page will be locked for returning to the caller.
972 static struct page * __find_lock_page_helper(
struct address_space *mapping,
973
unsigned long offset, struct page *hash)
974 {
975
struct page *page;
976
977
/*
978
* We scan the hash list read-only. Addition to and removal from
J.1.4 Searching the Page Cache (__find_lock_page_helper())
534
979
* the hash-list needs a held write-lock.
980
*/
981 repeat:
982
page = __find_page_nolock(mapping, offset, hash);
983
if (page) {
984
page_cache_get(page);
985
if (TryLockPage(page)) {
986
spin_unlock(&pagecache_lock);
987
lock_page(page);
988
spin_lock(&pagecache_lock);
989
990
/* Has the page been re-allocated while we slept? */
991
if (page->mapping != mapping || page->index != offset) {
992
UnlockPage(page);
993
page_cache_release(page);
994
goto repeat;
995
}
996
}
997
}
998
return page;
999 }
982 Use __find_page_nolock()(See Section J.1.4.3) to locate the page in the page
cache
983-984 If the page was found, take a reference to it
985
Try and lock the page with
test_and_set_bit()
page→flags
around
TryLockPage().
This macro is just a wrapper
which attempts to set the
PG_locked
bit in the
986-988 If the lock failed, release the pagecache_lock spinlock and call lock_page()
(See Section B.2.1.1) to lock the page.
until the page lock is acquired.
pagecache_lock
991
If the
mapping
It is likely this function will sleep
When the page is locked, acquire the
spinlock again
and
index
no longer match, it means that this page was re-
claimed while we were asleep. The page is unlocked and the reference dropped
before searching the page cache again
998 Return the page in a locked state, or NULL if it was not in the page cache
J.2 LRU List Operations
535
J.2 LRU List Operations
Contents
J.2 LRU List Operations
J.2.1 Adding Pages to the LRU Lists
J.2.1.1 Function: lru_cache_add()
J.2.1.2 Function: add_page_to_active_list()
J.2.1.3 Function: add_page_to_inactive_list()
J.2.2 Deleting Pages from the LRU Lists
J.2.2.1 Function: lru_cache_del()
J.2.2.2 Function: __lru_cache_del()
J.2.2.3 Function: del_page_from_active_list()
J.2.2.4 Function: del_page_from_inactive_list()
J.2.3 Activating Pages
J.2.3.1 Function: mark_page_accessed()
J.2.3.2 Function: activate_lock()
J.2.3.3 Function: activate_page_nolock()
535
535
535
535
536
536
536
537
537
537
538
538
538
538
J.2.1 Adding Pages to the LRU Lists
J.2.1.1
Function: lru_cache_add()
Adds a page to the LRU
(mm/swap.c)
inactive_list.
58 void lru_cache_add(struct page * page)
59 {
60
if (!PageLRU(page)) {
61
spin_lock(&pagemap_lru_lock);
62
if (!TestSetPageLRU(page))
63
add_page_to_inactive_list(page);
64
spin_unlock(&pagemap_lru_lock);
65
}
66 }
60 If the page is not already part of the LRU lists, add it
61 Acquire the LRU lock
62-63 Test and set the LRU bit.
If it was clear, call
add_page_to_inactive_list()
64 Release the LRU lock
J.2.1.2
Function: add_page_to_active_list()
Adds the page to the
active_list
178 #define add_page_to_active_list(page)
179 do {
(include/linux/swap.h)
\
\
J.2.1 Adding Pages to the LRU Lists (add_page_to_active_list())
180
DEBUG_LRU_PAGE(page);
181
SetPageActive(page);
182
list_add(&(page)->lru, &active_list);
183
nr_active_pages++;
184 } while (0)
180
The
DEBUG_LRU_PAGE()
macro will call
BUG()
536
\
\
\
\
if the page is already on the
LRU list or is marked been active
181 Update the ags of the page to show it is active
182 Add the page to the active_list
183 Update the count of the number of pages in the active_list
J.2.1.3
Function: add_page_to_inactive_list()
Adds the page to the
(include/linux/swap.h)
inactive_list
186 #define add_page_to_inactive_list(page)
187 do {
188
DEBUG_LRU_PAGE(page);
189
list_add(&(page)->lru, &inactive_list);
190
nr_inactive_pages++;
191 } while (0)
188
The
DEBUG_LRU_PAGE()
macro will call
BUG()
\
\
\
\
\
if the page is already on the
LRU list or is marked been active
189 Add the page to the inactive_list
190 Update the count of the number of inactive pages on the list
J.2.2 Deleting Pages from the LRU Lists
J.2.2.1
Function: lru_cache_del()
(mm/swap.c)
Acquire the lock protecting the LRU lists before calling
__lru_cache_del().
90 void lru_cache_del(struct page * page)
91 {
92
spin_lock(&pagemap_lru_lock);
93
__lru_cache_del(page);
94
spin_unlock(&pagemap_lru_lock);
95 }
92 Acquire the LRU lock
93 __lru_cache_del()
does the real work of removing the page from the LRU
lists
94 Release the LRU lock
J.2.2.2 Function: __lru_cache_del()
J.2.2.2
Function: __lru_cache_del()
537
(mm/swap.c)
Select which function is needed to remove the page from the LRU list.
75 void __lru_cache_del(struct page * page)
76 {
77
if (TestClearPageLRU(page)) {
78
if (PageActive(page)) {
79
del_page_from_active_list(page);
80
} else {
81
del_page_from_inactive_list(page);
82
}
83
}
84 }
77 Test and clear the ag indicating the page is in the LRU
78-82 If the page is on the LRU, select the appropriate removal function
78-79
from the inactive list with
J.2.2.3
del_page_from_active_list()
del_page_from_inactive_list()
If the page is active, then call
Function: del_page_from_active_list()
Remove the page from the
else delete
(include/linux/swap.h)
active_list
193 #define del_page_from_active_list(page)
194 do {
195
list_del(&(page)->lru);
196
ClearPageActive(page);
197
nr_active_pages--;
198 } while (0)
\
\
\
\
\
195 Delete the page from the list
196
Clear the ag indicating it is part of
active_list. The ag indicating
__lru_cache_del()
it is
part of the LRU list has already been cleared by
197 Update the count of the number of pages in the active_list
J.2.2.4
Function: del_page_from_inactive_list()
200 #define del_page_from_inactive_list(page)
201 do {
202
list_del(&(page)->lru);
203
nr_inactive_pages--;
204 } while (0)
(include/linux/swap.h)
\
\
\
\
202 Remove the page from the LRU list
203 Update the count of the number of pages in the inactive_list
J.2.3 Activating Pages
538
J.2.3 Activating Pages
J.2.3.1
Function: mark_page_accessed()
(mm/lemap.c)
This marks that a page has been referenced.
If the page is already on the
active_list or the referenced ag is clear, the referenced ag will be simply set. If
it is in the inactive_list and the referenced ag has been set, activate_page()
will be called to move the page to the top of the active_list.
1332 void mark_page_accessed(struct page *page)
1333 {
1334
if (!PageActive(page) && PageReferenced(page)) {
1335
activate_page(page);
1336
ClearPageReferenced(page);
1337
} else
1338
SetPageReferenced(page);
1339 }
1334-1337
inactive_list (!PageActive()) and has been
referenced recently (PageReferenced()), activate_page() is called to move
it to the active_list
If the page is on the
1338 Otherwise, mark the page as been referenced
J.2.3.2
Function: activate_lock()
(mm/swap.c)
Acquire the LRU lock before calling activate_page_nolock() which moves the
page from the inactive_list to the active_list.
47 void activate_page(struct page * page)
48 {
49
spin_lock(&pagemap_lru_lock);
50
activate_page_nolock(page);
51
spin_unlock(&pagemap_lru_lock);
52 }
49 Acquire the LRU lock
50 Call the main work function
51 Release the LRU lock
J.2.3.3
Function: activate_page_nolock()
Move the page from the
inactive_list
to the
(mm/swap.c)
active_list
39 static inline void activate_page_nolock(struct page * page)
40 {
41
if (PageLRU(page) && !PageActive(page)) {
42
del_page_from_inactive_list(page);
J.2.3 Activating Pages (activate_page_nolock())
43
44
45 }
}
add_page_to_active_list(page);
41 Make sure the page is on the LRU and not already on the active_list
42-43 Delete the page from the inactive_list and add to the active_list
539
J.3 Relling inactive_list
540
J.3 Relling inactive_list
Contents
J.3 Relling inactive_list
J.3.1 Function: refill_inactive()
540
540
This section covers how pages are moved from the active lists to the inactive
lists.
J.3.1
Function: refill_inactive()
(mm/vmscan.c)
Move nr_pages from the active_list to the inactive_list. The parameter
nr_pages is calculated by shrink_caches() and is a number which tries to keep
the active list two thirds the size of the page cache.
533 static void refill_inactive(int nr_pages)
534 {
535
struct list_head * entry;
536
537
spin_lock(&pagemap_lru_lock);
538
entry = active_list.prev;
539
while (nr_pages && entry != &active_list) {
540
struct page * page;
541
542
page = list_entry(entry, struct page, lru);
543
entry = entry->prev;
544
if (PageTestandClearReferenced(page)) {
545
list_del(&page->lru);
546
list_add(&page->lru, &active_list);
547
continue;
548
}
549
550
nr_pages--;
551
552
del_page_from_active_list(page);
553
add_page_to_inactive_list(page);
554
SetPageReferenced(page);
555
}
556
spin_unlock(&pagemap_lru_lock);
557 }
537 Acquire the lock protecting the LRU list
538 Take the last entry in the active_list
539-555 Move nr_pages or until the active_list is empty
542 Get the struct page for this entry
J.3 Relling inactive_list (refill_inactive())
544-548
Test and clear the referenced ag.
moved back to the top of the
541
If it has been referenced, then it is
active_list
550-553 Move one page from the active_list to the inactive_list
554
Mark it referenced so that if it is referenced again soon, it will be promoted
back to the
active_list
without requiring a second reference
556 Release the lock protecting the LRU list
J.4 Reclaiming Pages from the LRU Lists
542
J.4 Reclaiming Pages from the LRU Lists
Contents
J.4 Reclaiming Pages from the LRU Lists
J.4.1 Function: shrink_cache()
542
542
This section covers how a page is reclaimed once it has been selected for pageout.
J.4.1
Function: shrink_cache()
(mm/vmscan.c)
338 static int shrink_cache(int nr_pages, zone_t * classzone,
unsigned int gfp_mask, int priority)
339 {
340
struct list_head * entry;
341
int max_scan = nr_inactive_pages / priority;
342
int max_mapped = min((nr_pages << (10 - priority)),
max_scan / 10);
343
344
spin_lock(&pagemap_lru_lock);
345
while (--max_scan >= 0 &&
(entry = inactive_list.prev) != &inactive_list) {
338 The parameters are as follows;
nr_pages The number of pages to swap out
classzone The zone we are interested in swapping
pages out for. Pages not
belonging to this zone are skipped
gfp_mask The gfp mask determining what actions may be taken such as if
lesystem operations may be performed
priority
The priority of the function, starts at
DEF_PRIORITY
(6) and de-
creases to the highest priority of 1
341
The maximum number of pages to scan is the number of pages in the
active_list
divided by the priority.
At lowest priority, 1/6th of the list
may scanned. At highest priority, the full list may be scanned
342
The maximum amount of process mapped pages allowed is either one tenth
10−priority
of the max_scan value or nr_pages ∗ 2
. If this number of pages are
found, whole processes will be swapped out
344 Lock the LRU list
345 Keep scanning until max_scan pages have been scanned or the inactive_list
is empty
J.4 Reclaiming Pages from the LRU Lists (shrink_cache())
346
347
348
349
350
351
352
353
354
355
543
struct page * page;
if (unlikely(current->need_resched)) {
spin_unlock(&pagemap_lru_lock);
__set_current_state(TASK_RUNNING);
schedule();
spin_lock(&pagemap_lru_lock);
continue;
}
348-354 Reschedule if the quanta has been used up
349 Free the LRU lock as we are about to sleep
350 Show we are still running
351 Call schedule() so another process can be context switched in
352 Re-acquire the LRU lock
353
Reiterate through the loop and take an entry
inactive_list
again. As we
slept, another process could have changed what entries are on the list which
is why another entry has to be taken with the spinlock held
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
page = list_entry(entry, struct page, lru);
BUG_ON(!PageLRU(page));
BUG_ON(PageActive(page));
list_del(entry);
list_add(entry, &inactive_list);
/*
* Zero page counts can happen because we unlink the pages
* _after_ decrementing the usage count..
*/
if (unlikely(!page_count(page)))
continue;
if (!memclass(page_zone(page), classzone))
continue;
/* Racy check to avoid trylocking when not worthwhile */
if (!page->buffers && (page_count(page) != 1 || !page->mapping))
goto page_mapped;
J.4 Reclaiming Pages from the LRU Lists (shrink_cache())
544
356 Get the struct page for this entry in the LRU
358-359
It is a bug if the page either belongs to the
active_list
or is currently
marked as active
361-362 Move the page to the top of the inactive_list so that if the page is not
freed, we can just continue knowing that it will be simply examined later
368-369 If the page count has already reached 0, skip over it.
__free_pages(),
the page count is dropped with put_page_testzero() before __free_pages_ok()
In
is called to free it. This leaves a window where a page with a zero count is
left on the LRU before it is freed. There is a special case to trap this at the
beginning of
__free_pages_ok()
371-372 Skip over this page if it belongs to a zone we are not currently interested
in
375-376
page_mapped where the
examined. If max_mapped reaches
If the page is mapped by a process, then goto
max_mapped
is decremented and next page
0, process pages will be swapped out
382
383
384
385
386
387
388
389
390
391
if (unlikely(TryLockPage(page))) {
if (PageLaunder(page) && (gfp_mask & __GFP_FS)) {
page_cache_get(page);
spin_unlock(&pagemap_lru_lock);
wait_on_page(page);
page_cache_release(page);
spin_lock(&pagemap_lru_lock);
}
continue;
}
Page is locked and the launder bit is set. In this case, it is the second time this
page has been found dirty. The rst time it was scheduled for IO and placed back
on the list. This time we wait until the IO is complete and then try to free the page.
382-383
If we could not lock the page, the
PG_launder
bit is set and the GFP
ags allow the caller to perform FS operations, then...
384 Take a reference to the page so it does not disappear while we sleep
385 Free the LRU lock
386 Wait until the IO is complete
387 Release the reference to the page.
388 Re-acquire the LRU lock
If it reaches 0, the page will be freed
J.4 Reclaiming Pages from the LRU Lists (shrink_cache())
545
390 Move to the next page
392
393
if (PageDirty(page) &&
is_page_cache_freeable(page) &&
page->mapping) {
/*
* It is not critical here to write it only if
* the page is unmapped beause any direct writer
* like O_DIRECT would set the PG_dirty bitflag
* on the phisical page after having successfully
* pinned it and after the I/O to the page is finished,
* so the direct writes to the page cannot get lost.
*/
int (*writepage)(struct page *);
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
writepage = page->mapping->a_ops->writepage;
if ((gfp_mask & __GFP_FS) && writepage) {
ClearPageDirty(page);
SetPageLaunder(page);
page_cache_get(page);
spin_unlock(&pagemap_lru_lock);
writepage(page);
page_cache_release(page);
}
}
spin_lock(&pagemap_lru_lock);
continue;
This handles the case where a page is dirty, is not mapped by any process, has
no buers and is backed by a le or device mapping. The page is cleaned and will
be reclaimed by the previous block of code when the IO is complete.
393 PageDirty()
checks the
PG_dirty
bit,
is_page_cache_freeable()
will re-
turn true if it is not mapped by any process and has no buers
404
Get a pointer to the necessary
writepage()
function for this mapping or
device
405-416
This block of code can only be executed if a
writepage()
function is
available and the GFP ags allow le operations
406-407 Clear the dirty bit and mark that the page is being laundered
408 Take a reference to the page so it will not be freed unexpectedly
J.4 Reclaiming Pages from the LRU Lists (shrink_cache())
546
409 Unlock the LRU list
411
Call the lesystem-specic writepage() function which
address_space_operations belonging to page→mapping
is taken from the
412 Release the reference to the page
414-415 Re-acquire the LRU list lock and move to the next page
424
425
426
427
428
429
430
431
438
439
440
441
442
443
444
445
446
447
448
454
455
456
457
458
459
460
461
462
463
464
465
466
if (page->buffers) {
spin_unlock(&pagemap_lru_lock);
/* avoid to free a locked page */
page_cache_get(page);
if (try_to_release_page(page, gfp_mask)) {
if (!page->mapping) {
spin_lock(&pagemap_lru_lock);
UnlockPage(page);
__lru_cache_del(page);
/* effectively free the page here */
page_cache_release(page);
if (--nr_pages)
continue;
break;
} else {
page_cache_release(page);
spin_lock(&pagemap_lru_lock);
}
} else {
/* failed to drop the buffers so stop here */
UnlockPage(page);
page_cache_release(page);
}
}
spin_lock(&pagemap_lru_lock);
continue;
Page has buers associated with it that must be freed.
425 Release the LRU lock as we may sleep
428 Take a reference to the page
J.4 Reclaiming Pages from the LRU Lists (shrink_cache())
430
Call
try_to_release_page()
547
which will attempt to release the buers asso-
ciated with the page. Returns 1 if it succeeds
431-447
This is a case where an anonymous page that was in the swap cache
has now had it's buers cleared and removed. As it was on the swap cache,
add_to_swap_cache() so remove it now frmo
the LRU and drop the reference to the page. In swap_writepage(), it calls
remove_exclusive_swap_page() which will delete the page from the swap
it was placed on the LRU by
cache when there are no more processes mapping the page.
This block will
free the page after the buers have been written out if it was backed by a swap
le
438-440
Take the LRU list lock, unlock the page, delete it from the page cache
and free it
445-446
Update
nr_pages
to show a page has been freed and move to the next
page
447 If nr_pages drops to 0, then exit the loop as the work is completed
449-456 If the page does have an associated mapping then simply drop the reference to the page and re-acquire the LRU lock. More work will be performed
later to remove the page from the page cache at line 499
459-464 If the buers could not be freed, then unlock the page, drop the reference
to it, re-acquire the LRU lock and move to the next page
468
spin_lock(&pagecache_lock);
469
470
/*
471
* this is the non-racy check for busy page.
472
*/
473
if (!page->mapping || !is_page_cache_freeable(page)) {
474
spin_unlock(&pagecache_lock);
475
UnlockPage(page);
476 page_mapped:
477
if (--max_mapped >= 0)
478
continue;
479
484
spin_unlock(&pagemap_lru_lock);
485
swap_out(priority, gfp_mask, classzone);
486
return nr_pages;
487
}
468
From this point on, pages in the swap cache are likely to be examined which
is protected by the
pagecache_lock
which must be now held
473-487 An anonymous page with no buers is mapped by a process
J.4 Reclaiming Pages from the LRU Lists (shrink_cache())
548
474-475 Release the page cache lock and the page
477-478 Decrement max_mapped.
484-485
If it has not reached 0, move to the next page
Too many mapped pages have been found in the page cache. The LRU
lock is released and
493
494
495
496
497
swap_out() is called to begin swapping out whole processes
if (PageDirty(page)) {
spin_unlock(&pagecache_lock);
UnlockPage(page);
continue;
}
493-497 The page has no references but could have been dirtied by the last process
to free it if the dirty bit was set in the PTE. It is left in the page cache and
will get laundered later. Once it has been cleaned, it can be safely deleted
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
/* point of no return */
if (likely(!PageSwapCache(page))) {
__remove_inode_page(page);
spin_unlock(&pagecache_lock);
} else {
swp_entry_t swap;
swap.val = page->index;
__delete_from_swap_cache(page);
spin_unlock(&pagecache_lock);
swap_free(swap);
}
__lru_cache_del(page);
UnlockPage(page);
/* effectively free the page here */
page_cache_release(page);
}
500-503
if (--nr_pages)
continue;
break;
If the page does not belong to the swap cache, it is part of the inode
queue so it is removed
504-508 Remove it from the swap cache as there is no more references to it
511 Delete it from the page cache
J.4 Reclaiming Pages from the LRU Lists (shrink_cache())
549
512 Unlock the page
515 Free the page
517-518 Decrement nr_page and move to the next page if it is not 0
519 If it reaches 0, the work of the function is complete
521
522
523
524 }
spin_unlock(&pagemap_lru_lock);
return nr_pages;
521-524 Function exit.
free
Free the LRU lock and return the number of pages left to
J.5 Shrinking all caches
550
J.5 Shrinking all caches
Contents
J.5 Shrinking all caches
J.5.1 Function: shrink_caches()
J.5.2 Function: try_to_free_pages()
J.5.3 Function: try_to_free_pages_zone()
J.5.1
Function: shrink_caches()
550
550
551
552
(mm/vmscan.c)
The call graph for this function is shown in Figure 10.4.
560 static int shrink_caches(zone_t * classzone, int priority,
unsigned int gfp_mask, int nr_pages)
561 {
562
int chunk_size = nr_pages;
563
unsigned long ratio;
564
565
nr_pages -= kmem_cache_reap(gfp_mask);
566
if (nr_pages <= 0)
567
return 0;
568
569
nr_pages = chunk_size;
570
/* try to keep the active list 2/3 of the size of the cache */
571
ratio = (unsigned long) nr_pages *
nr_active_pages / ((nr_inactive_pages + 1) * 2);
572
refill_inactive(ratio);
573
574
nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
575
if (nr_pages <= 0)
576
return 0;
577
578
shrink_dcache_memory(priority, gfp_mask);
579
shrink_icache_memory(priority, gfp_mask);
580 #ifdef CONFIG_QUOTA
581
shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
582 #endif
583
584
return nr_pages;
585 }
560 The parameters are as follows;
classzone is the zone that pages should be freed from
priority determines how much work will be done to free pages
gfp_mask determines what sort of actions may be taken
J.5 Shrinking all caches (shrink_caches())
nr_pages
565-567
is the number of pages remaining to be freed
Ask the slab allocator to free up some pages with
(See Section H.1.5.1).
nr_pages
571-572
551
kmem_cache_reap()
If enough are freed, the function returns otherwise
will be freed from other caches
Move pages from the
refill_inactive() (See
active_list
to the
inactive_list
by calling
Section J.3.1). The number of pages moved depends
on how many pages need to be freed and to have
active_list about two thirds
the size of the page cache
574-575 Shrink the page cache, if enough pages are freed, return
578-582 Shrink the dcache, icache and dqcache.
These are small objects in them-
selves but the cascading eect frees up a lot of disk buers
584 Return the number of pages remaining to be freed
J.5.2
Function: try_to_free_pages()
(mm/vmscan.c)
pgdats and tries to balance the preferred allocaZONE_NORMAL) for each of them. This function is only called from
one place, buffer.c:free_more_memory() when the buer manager fails to create
new buers or grow existing ones. It calls try_to_free_pages() with GFP_NOIO as
the gfp_mask.
This results in the rst zone in pg_data_t→node_zonelists having pages freed
This function cycles through all
tion zone (usually
so that buers can grow. This array is the preferred order of zones to allocate from
and usually will begin with
ZONE_NORMAL
which is required by the buer manager.
On NUMA architectures, some nodes may have
ZONE_DMA
as the preferred zone if
the memory bank is dedicated to IO devices and UML also uses only this zone. As
the buer manager is restricted in the zones is uses, there is no point balancing other
zones.
607 int try_to_free_pages(unsigned int gfp_mask)
608 {
609
pg_data_t *pgdat;
610
zonelist_t *zonelist;
611
unsigned long pf_free_pages;
612
int error = 0;
613
614
pf_free_pages = current->flags & PF_FREE_PAGES;
615
current->flags &= ~PF_FREE_PAGES;
616
617
for_each_pgdat(pgdat) {
618
zonelist = pgdat->node_zonelists +
(gfp_mask & GFP_ZONEMASK);
619
error |= try_to_free_pages_zone(
J.5 Shrinking all caches (try_to_free_pages())
620
621
622
623
624 }
552
zonelist->zones[0], gfp_mask);
}
current->flags |= pf_free_pages;
return error;
614-615 This clears the PF_FREE_PAGES ag if it is set so that pages freed by the
process will be returned to the global pool rather than reserved for the process
itself
617-620 Cycle through all nodes and call try_to_free_pages() for the preferred
zone in each node
618
This function is only called with
with
GFP_ZONEMASK,
GFP_NOIO
as a parameter.
When ANDed
it will always result in 0
622-623 Restore the process ags and return the result
J.5.3
Function: try_to_free_pages_zone()
Try to free
used by
SWAP_CLUSTER_MAX
(mm/vmscan.c)
pages from the requested zone. As will as being
kswapd, this function is the entry for the buddy allocator's direct-reclaim
path.
587 int try_to_free_pages_zone(zone_t *classzone,
unsigned int gfp_mask)
588 {
589
int priority = DEF_PRIORITY;
590
int nr_pages = SWAP_CLUSTER_MAX;
591
592
gfp_mask = pf_gfp_mask(gfp_mask);
593
do {
594
nr_pages = shrink_caches(classzone, priority,
gfp_mask, nr_pages);
595
if (nr_pages <= 0)
596
return 1;
597
} while (--priority);
598
599
/*
600
* Hmm.. Cache shrink failed - time to kill something?
601
* Mhwahahhaha! This is the part I really like. Giggle.
602
*/
603
out_of_memory();
604
return 0;
605 }
J.5 Shrinking all caches (try_to_free_pages_zone())
589 Start with the lowest priority.
553
Statically dened to be 6
590 Try and free SWAP_CLUSTER_MAX pages.
Statically dened to be 32
592 pf_gfp_mask() checks the PF_NOIO ag in the current process ags.
If no IO
can be performed, it ensures there is no incompatible ags in the GFP mask
593-597
Starting with the lowest priority and increasing with each pass, call
shrink_caches()
until nr_pages has been freed
595-596 If enough pages were freed, return indicating that the work is complete
603
If enough pages could not be freed even at highest priority (where at worst
the full
inactive_list is scanned) then check to see if we are out of memory.
If we are, then a process will be selected to be killed
604 Return indicating that we failed to free enough pages
J.6 Swapping Out Process Pages
554
J.6 Swapping Out Process Pages
Contents
J.6 Swapping Out Process Pages
J.6.1 Function: swap_out()
J.6.2 Function: swap_out_mm()
J.6.3 Function: swap_out_vma()
J.6.4 Function: swap_out_pgd()
J.6.5 Function: swap_out_pmd()
J.6.6 Function: try_to_swap_out()
554
554
556
557
558
559
561
This section covers the path where too many process mapped pages have been
found in the LRU lists. This path will start scanning whole processes and reclaiming
the mapped pages.
J.6.1
Function: swap_out()
(mm/vmscan.c)
The call graph for this function is shown in Figure 10.5. This function linearaly
searches through every processes page tables trying to swap out
number of pages. The process it starts with is the
is
mm→swap_address
SWAP_CLUSTER_MAX
swap_mm and the starting address
296 static int swap_out(unsigned int priority, unsigned int gfp_mask,
zone_t * classzone)
297 {
298
int counter, nr_pages = SWAP_CLUSTER_MAX;
299
struct mm_struct *mm;
300
301
counter = mmlist_nr;
302
do {
303
if (unlikely(current->need_resched)) {
304
__set_current_state(TASK_RUNNING);
305
schedule();
306
}
307
308
spin_lock(&mmlist_lock);
309
mm = swap_mm;
310
while (mm->swap_address == TASK_SIZE || mm == &init_mm) {
311
mm->swap_address = 0;
312
mm = list_entry(mm->mmlist.next,
struct mm_struct, mmlist);
313
if (mm == swap_mm)
314
goto empty;
315
swap_mm = mm;
316
}
317
318
/* Make sure the mm doesn't disappear
J.6 Swapping Out Process Pages (swap_out())
555
when we drop the lock.. */
319
atomic_inc(&mm->mm_users);
320
spin_unlock(&mmlist_lock);
321
322
nr_pages = swap_out_mm(mm, nr_pages, &counter, classzone);
323
324
mmput(mm);
325
326
if (!nr_pages)
327
return 1;
328
} while (--counter >= 0);
329
330
return 0;
331
332 empty:
333
spin_unlock(&mmlist_lock);
334
return 0;
335 }
301 Set the counter so the process list is only scanned once
303-306 Reschedule if the quanta has been used up to prevent CPU hogging
308 Acquire the lock protecting the mm list
309 Start with the swap_mm.
It is interesting this is never checked to make sure it
is valid. It is possible, albeit unlikely that the process with the
since the last scan
and
the slab holding the
mm_struct
mm
has exited
has been reclaimed
during a cache shrink making the pointer totally invalid.
The lack of bug
reports might be because the slab rarely gets reclaimed and would be dicult
to trigger in reality
310-316 Move to the next process if the swap_address has reached the TASK_SIZE
or if the mm is the
init_mm
311 Start at the beginning of the process space
312 Get the mm for this process
313-314 If it is the same, there is no running processes that can be examined
315 Record the swap_mm for the next pass
319
Increase the reference count so that the mm does not get freed while we are
scanning
320 Release the mm lock
322 Begin scanning the mm with swap_out_mm()(See Section J.6.2)
J.6 Swapping Out Process Pages (swap_out())
556
324 Drop the reference to the mm
326-327 If the required number of pages has been freed, return success
328 If we failed on this pass, increase the priority so more processes will be scanned
330 Return failure
J.6.2
Function: swap_out_mm()
Walk through each VMA and call
(mm/vmscan.c)
swap_out_mm() for
each one.
256 static inline int swap_out_mm(struct mm_struct * mm, int count,
int * mmcounter, zone_t * classzone)
257 {
258
unsigned long address;
259
struct vm_area_struct* vma;
260
265
spin_lock(&mm->page_table_lock);
266
address = mm->swap_address;
267
if (address == TASK_SIZE || swap_mm != mm) {
268
/* We raced: don't count this mm but try again */
269
++*mmcounter;
270
goto out_unlock;
271
}
272
vma = find_vma(mm, address);
273
if (vma) {
274
if (address < vma->vm_start)
275
address = vma->vm_start;
276
277
for (;;) {
278
count = swap_out_vma(mm, vma, address,
count, classzone);
279
vma = vma->vm_next;
280
if (!vma)
281
break;
282
if (!count)
283
goto out_unlock;
284
address = vma->vm_start;
285
}
286
}
287
/* Indicate that we reached the end of address space */
288
mm->swap_address = TASK_SIZE;
289
290 out_unlock:
291
spin_unlock(&mm->page_table_lock);
292
return count;
293 }
J.6 Swapping Out Process Pages (swap_out_mm())
557
265 Acquire the page table lock for this mm
266 Start with the address contained in swap_address
267-271
If the address is
TASK_SIZE,
it means that a thread raced and scanned
this process already. Increase mmcounter so that
swap_out_mm() knows to go
to another process
272 Find the VMA for this address
273 Presuming a VMA was found then ....
274-275 Start at the beginning of the VMA
277-285
Scan through this and each subsequent VMA calling
swap_out_vma()
(See Section J.6.3) for each one. If the requisite number of pages (count) is
freed, then nish scanning and return
288
J.6.3
swap_address to TASK_SIZE
swap_out_mm() next time
Once the last VMA has been scanned, set
that this process will be skipped over by
Function: swap_out_vma()
(mm/vmscan.c)
Walk through this VMA and for each PGD in it, call
swap_out_pgd().
227 static inline int swap_out_vma(struct mm_struct * mm,
struct vm_area_struct * vma,
unsigned long address, int count,
zone_t * classzone)
228 {
229
pgd_t *pgdir;
230
unsigned long end;
231
232
/* Don't swap out areas which are reserved */
233
if (vma->vm_flags & VM_RESERVED)
234
return count;
235
236
pgdir = pgd_offset(mm, address);
237
238
end = vma->vm_end;
239
BUG_ON(address >= end);
240
do {
241
count = swap_out_pgd(mm, vma, pgdir,
address, end, count, classzone);
242
if (!count)
243
break;
244
address = (address + PGDIR_SIZE) & PGDIR_MASK;
245
pgdir++;
so
J.6 Swapping Out Process Pages (swap_out_vma())
246
247
248 }
558
} while (address && (address < end));
return count;
233-234 Skip over this VMA if the VM_RESERVED ag is set.
This is used by some
device drivers such as the SCSI generic driver
236 Get the starting PGD for the address
238
Mark where the end is and
BUG()
it if the starting address is somehow past
the end
240 Cycle through PGDs until the end address is reached
241
Call
swap_out_pgd()(See
Section J.6.4) keeping count of how many more
pages need to be freed
242-243 If enough pages have been freed, break and return
244-245
Move to the next PGD and move the address to the next PGD aligned
address
247 Return the remaining number of pages to be freed
J.6.4
Function: swap_out_pgd()
(mm/vmscan.c)
Step through all PMD's in the supplied PGD and call
swap_out_pmd()
197 static inline int swap_out_pgd(struct mm_struct * mm,
struct vm_area_struct * vma, pgd_t *dir,
unsigned long address, unsigned long end,
int count, zone_t * classzone)
198 {
199
pmd_t * pmd;
200
unsigned long pgd_end;
201
202
if (pgd_none(*dir))
203
return count;
204
if (pgd_bad(*dir)) {
205
pgd_ERROR(*dir);
206
pgd_clear(dir);
207
return count;
208
}
209
210
pmd = pmd_offset(dir, address);
211
212
pgd_end = (address + PGDIR_SIZE) & PGDIR_MASK;
213
if (pgd_end && (end > pgd_end))
J.6 Swapping Out Process Pages (swap_out_pgd())
214
215
216
217
559
end = pgd_end;
218
219
220
221
222
223
224 }
do {
count = swap_out_pmd(mm, vma, pmd,
address, end, count, classzone);
if (!count)
break;
address = (address + PMD_SIZE) & PMD_MASK;
pmd++;
} while (address && (address < end));
return count;
202-203 If there is no PGD, return
204-208 If the PGD is bad, ag it as such and return
210 Get the starting PMD
212-214 Calculate the end to be the end of this PGD or the end of the VMA been
scanned, whichever is closer
216-222 For each PMD in this PGD, call swap_out_pmd() (See Section J.6.5).
If
enough pages get freed, break and return
223 Return the number of pages remaining to be freed
J.6.5
Function: swap_out_pmd()
(mm/vmscan.c)
For each PTE in this PMD, call try_to_swap_out().
On completion,
mm→swap_address
is updated to show where we nished to prevent the same page been examined soon
after this scan.
158 static inline int swap_out_pmd(struct mm_struct * mm,
struct vm_area_struct * vma, pmd_t *dir,
unsigned long address, unsigned long end,
int count, zone_t * classzone)
159 {
160
pte_t * pte;
161
unsigned long pmd_end;
162
163
if (pmd_none(*dir))
164
return count;
165
if (pmd_bad(*dir)) {
166
pmd_ERROR(*dir);
167
pmd_clear(dir);
168
return count;
J.6 Swapping Out Process Pages (swap_out_pmd())
169
170
171
172
173
174
175
176
177
178
179
180
181
182
560
}
pte = pte_offset(dir, address);
pmd_end = (address + PMD_SIZE) & PMD_MASK;
if (end > pmd_end)
end = pmd_end;
do {
if (pte_present(*pte)) {
struct page *page = pte_page(*pte);
183
184
185
186
187
188
189
190
191
192
193
194 }
if (VALID_PAGE(page) && !PageReserved(page)) {
count -= try_to_swap_out(mm, vma,
address, pte,
page, classzone);
if (!count) {
address += PAGE_SIZE;
break;
}
}
}
address += PAGE_SIZE;
pte++;
} while (address && (address < end));
mm->swap_address = address;
return count;
163-164 Return if there is no PMD
165-169 If the PMD is bad, ag it as such and return
171 Get the starting PTE
173-175
Calculate the end to be the end of the PMD or the end of the VMA,
whichever is closer
177-191 Cycle through each PTE
178 Make sure the PTE is marked present
179 Get the struct page for this PTE
181 If it is a valid page and it is not reserved then ...
182 Call try_to_swap_out()
J.6 Swapping Out Process Pages (swap_out_pmd())
183-186
561
If enough pages have been swapped out, move the address to the next
page and break to return
189-190 Move to the next page and PTE
192 Update the swap_address to show where we last nished o
193 Return the number of pages remaining to be freed
J.6.6
Function: try_to_swap_out()
(mm/vmscan.c)
This function tries to swap out a page from a process. It is quite a large function
so will be dealt with in parts. Broadly speaking they are
•
Function preamble, ensure this is a page that should be swapped out
•
Remove the page and PTE from the page tables
•
Handle the case where the page is already in the swap cache
•
Handle the case where the page is dirty or has associated buers
•
Handle the case where the page is been added to the swap cache
47 static inline int try_to_swap_out(struct mm_struct * mm,
struct vm_area_struct* vma,
unsigned long address,
pte_t * page_table,
struct page *page,
zone_t * classzone)
48 {
49
pte_t pte;
50
swp_entry_t entry;
51
52
/* Don't look at this pte if it's been accessed recently. */
53
if ((vma->vm_flags & VM_LOCKED) ||
ptep_test_and_clear_young(page_table)) {
54
mark_page_accessed(page);
55
return 0;
56
}
57
58
/* Don't bother unmapping pages that are active */
59
if (PageActive(page))
60
return 0;
61
62
/* Don't bother replenishing zones not under pressure.. */
63
if (!memclass(page_zone(page), classzone))
64
return 0;
J.6 Swapping Out Process Pages (try_to_swap_out())
65
66
67
562
if (TryLockPage(page))
return 0;
53-56 If the page is locked (for tasks like IO) or the PTE shows the page has been
accessed recently then clear the referenced bit and call
mark_page_accessed()
(See Section J.2.3.1) to make the struct page reect the age. Return 0 to show
it was not swapped out
59-60 If the page is on the active_list, do not swap it out
63-64 If the page belongs to a zone we are not interested in, do not swap it out
66-67 If the page is already locked for IO, skip it
74
75
76
77
78
79
80
flush_cache_page(vma, address);
pte = ptep_get_and_clear(page_table);
flush_tlb_page(vma, address);
if (pte_dirty(pte))
set_page_dirty(page);
74 Call the architecture hook to ush this page from all CPUs
75 Get the PTE from the page tables and clear it
76 Call the architecture hook to ush the TLB
78-79
If the PTE was marked dirty, mark the
struct page
dirty so it will be
laundered correctly
86
if (PageSwapCache(page)) {
87
entry.val = page->index;
88
swap_duplicate(entry);
89 set_swap_pte:
90
set_pte(page_table, swp_entry_to_pte(entry));
91 drop_pte:
92
mm->rss--;
93
UnlockPage(page);
94
{
95
int freeable =
page_count(page) - !!page->buffers <= 2;
96
page_cache_release(page);
97
return freeable;
98
}
99
}
J.6 Swapping Out Process Pages (try_to_swap_out())
563
Handle the case where the page is already in the swap cache
86 Enter this block only if the page is already in the swap cache.
also be entered by calling goto to the
87-88
set_swap_pte
Fill in the index value for the swap entry.
and
Note that it can
drop_pte
swap_duplicate()
labels
veries the
swap identier is valid and increases the counter in the swap_map if it is
90 Fill the PTE with information needed to get the page from swap
92 Update RSS to show there is one less page being mapped by the process
93 Unlock the page
95 The page is free-able if the count is currently 2 or less and has no buers.
If the
count is higher, it is either being mapped by other processes or is a le-backed
page and the user is the page cache
96 Decrement the reference count and free the page if it reaches 0.
Note that if this
is a le-backed page, it will not reach 0 even if there are no processes mapping
it. The page will be later reclaimed from the page cache by
shrink_cache()
(See Section J.4.1)
97 Return if the page was freed or not
115
116
117
118
124
125
if (page->mapping)
goto drop_pte;
if (!PageDirty(page))
goto drop_pte;
if (page->buffers)
goto preserve;
115-116
If the page has an associated mapping, simply drop it from the page
tables. When no processes are mapping it, it will be reclaimed from the page
cache by
shrink_cache()
117-118 If the page is clean, it is safe to simply drop it
124-125
If it has associated buers due to a truncate followed by a page fault,
then re-attach the page and PTE to the page tables as it cannot be handled
yet
126
127
128
129
130
131
/*
*
*
*
*
This is a dirty, swappable page. First of all,
get a suitable swap entry for it, and make sure
we have the swap cache set up to associate the
page with that swap entry.
J.6 Swapping Out Process Pages (try_to_swap_out())
564
132
*/
133
for (;;) {
134
entry = get_swap_page();
135
if (!entry.val)
136
break;
137
/* Add it to the swap cache and mark it dirty
138
* (adding to the page cache will clear the dirty
139
* and uptodate bits, so we need to do it again)
140
*/
141
if (add_to_swap_cache(page, entry) == 0) {
142
SetPageUptodate(page);
143
set_page_dirty(page);
144
goto set_swap_pte;
145
}
146
/* Raced with "speculative" read_swap_cache_async */
147
swap_free(entry);
148
}
149
150
/* No swap space left */
151 preserve:
152
set_pte(page_table, pte);
153
UnlockPage(page);
154
return 0;
155 }
134 Allocate a swap entry for this page
135-136 If one could not be allocated, break out where the PTE and page will be
re-attached to the process page tables
141 Add the page to the swap cache
142 Mark the page as up to date in memory
143 Mark the page dirty so that it will be written out to swap soon
144
Goto
set_swap_pte
which will update the PTE with information needed to
get the page from swap later
147 If the add to swap cache failed, it means that the page was placed in the swap
cache already by a readahead so drop the work done here
152 Reattach the PTE to the page tables
153 Unlock the page
154 Return that no page was freed
J.7 Page Swap Daemon
565
J.7 Page Swap Daemon
Contents
J.7 Page Swap Daemon
J.7.1 Initialising kswapd
J.7.1.1 Function: kswapd_init()
J.7.2 kswapd Daemon
J.7.2.1 Function: kswapd()
J.7.2.2 Function: kswapd_can_sleep()
J.7.2.3 Function: kswapd_can_sleep_pgdat()
J.7.2.4 Function: kswapd_balance()
J.7.2.5 Function: kswapd_balance_pgdat()
This section details the main loops used by the
565
565
565
565
565
567
567
568
568
kswapd daemon which is woken-
up when memory is low. The main functions covered are the ones that determine if
kswapd can sleep and how it determines which nodes need balancing.
J.7.1 Initialising kswapd
J.7.1.1
Function: kswapd_init()
(mm/vmscan.c)
Start the kswapd kernel thread
767 static int __init kswapd_init(void)
768 {
769
printk("Starting kswapd\n");
770
swap_setup();
771
kernel_thread(kswapd, NULL, CLONE_FS
| CLONE_FILES
| CLONE_SIGNAL);
772
return 0;
773 }
770 swap_setup()(See Section K.4.2) setups up how many pages will be prefetched
when reading from backing storage based on the amount of physical memory
771 Start the kswapd kernel thread
J.7.2 kswapd Daemon
J.7.2.1
Function: kswapd() (mm/vmscan.c)
The main function of the kswapd kernel thread.
720 int kswapd(void *unused)
721 {
722
struct task_struct *tsk = current;
723
DECLARE_WAITQUEUE(wait, tsk);
J.7.2 kswapd Daemon (kswapd())
724
725
726
727
728
741
742
746
747
748
749
750
751
752
753
754
755
756
762
763
764
765 }
566
daemonize();
strcpy(tsk->comm, "kswapd");
sigfillset(&tsk->blocked);
tsk->flags |= PF_MEMALLOC;
for (;;) {
__set_current_state(TASK_INTERRUPTIBLE);
add_wait_queue(&kswapd_wait, &wait);
mb();
if (kswapd_can_sleep())
schedule();
__set_current_state(TASK_RUNNING);
remove_wait_queue(&kswapd_wait, &wait);
}
kswapd_balance();
run_task_queue(&tq_disk);
725 Call daemonize() which will make this a kernel thread, remove the mm context, close all les and re-parent the process
726 Set the name of the process
727 Ignore all signals
741
By setting this ag, the physical page allocator will always try to satisfy
requests for pages. As this process will always be trying to free pages, it is
worth satisfying requests
746-764 Endlessly loop
747-748 This adds kswapd to the wait queue in preparation to sleep
750
The Memory Block function (mb()) ensures that all reads and writes that
occurred before this line will be visible to all CPU's
751 kswapd_can_sleep()(See Section J.7.2.2) cycles through all nodes and zones
checking the
need_balance
eld. If any of them are set to 1, kswapd can not
sleep
752
By calling
schedule(), kswapd will now
__alloc_pages()
physical page allocator in
sleep until woken again by the
(See Section F.1.3)
J.7.2 kswapd Daemon (kswapd())
754-755
Once woken up,
kswapd
567
is removed from the wait queue as it is now
running
762 kswapd_balance()(See Section J.7.2.4)
try_to_free_pages_zone()(See
cycles through all zones and calls
Section J.5.3) for each zone that requires
balance
763 Run the IO task queue to start writing data out to disk
J.7.2.2
Function: kswapd_can_sleep()
(mm/vmscan.c)
kswapd_can_sleep_pgdat()
Simple function to cycle through all pgdats to call
on each.
695 static int kswapd_can_sleep(void)
696 {
697
pg_data_t * pgdat;
698
699
for_each_pgdat(pgdat) {
700
if (!kswapd_can_sleep_pgdat(pgdat))
701
return 0;
702
}
703
704
return 1;
705 }
699-702 for_each_pgdat() does exactly as the name implies.
It cycles through all
kswapd_can_sleep_pgdat() (See
only be one pgdat
available pgdat's and in this case calls
for each. On the x86, there will
J.7.2.3
Function: kswapd_can_sleep_pgdat()
(mm/vmscan.c)
Cycles through all zones to make sure none of them need balance.
zone→need_balanace
__alloc_pages()
pages_low watermark.
ag is set by
pages in the zone reaches the
Section J.7.2.3)
The
when the number of free
680 static int kswapd_can_sleep_pgdat(pg_data_t * pgdat)
681 {
682
zone_t * zone;
683
int i;
684
685
for (i = pgdat->nr_zones-1; i >= 0; i--) {
686
zone = pgdat->node_zones + i;
687
if (!zone->need_balance)
688
continue;
689
return 0;
690
}
J.7.2 kswapd Daemon (kswapd_can_sleep_pgdat())
691
692
693 }
568
return 1;
685-689 Simple for loop to cycle through all zones
686
The
node_zones
eld is an array of all available zones so adding
i
gives the
index
687-688 If the zone does not need balance, continue
689 0 is returned if any needs balance indicating kswapd can not sleep
692 Return indicating kswapd can sleep if the for loop completes
J.7.2.4
Function: kswapd_balance()
(mm/vmscan.c)
Continuously cycle through each pgdat until none require balancing
667 static void kswapd_balance(void)
668 {
669
int need_more_balance;
670
pg_data_t * pgdat;
671
672
do {
673
need_more_balance = 0;
674
675
for_each_pgdat(pgdat)
676
need_more_balance |= kswapd_balance_pgdat(pgdat);
677
} while (need_more_balance);
678 }
672-677
Cycle through all
pgdats
until none of them report that they need bal-
ancing
675
kswapd_balance_pgdat() to check if the node requires
node required balancing, need_more_balance will be set to
For each pgdat, call
balancing. If any
1
J.7.2.5
Function: kswapd_balance_pgdat()
(mm/vmscan.c)
This function will check if a node requires balance by examining each of the
nodes in it.
If any zone requires balancing,
try_to_free_pages_zone()
called.
641 static int kswapd_balance_pgdat(pg_data_t * pgdat)
642 {
643
int need_more_balance = 0, i;
644
zone_t * zone;
will be
J.7.2 kswapd Daemon (kswapd_balance_pgdat())
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665 }
569
for (i = pgdat->nr_zones-1; i >= 0; i--) {
zone = pgdat->node_zones + i;
if (unlikely(current->need_resched))
schedule();
if (!zone->need_balance)
continue;
if (!try_to_free_pages_zone(zone, GFP_KSWAPD)) {
zone->need_balance = 0;
__set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(HZ);
continue;
}
if (check_classzone_need_balance(zone))
need_more_balance = 1;
else
zone->need_balance = 0;
}
return need_more_balance;
646-662 Cycle through each zone and call try_to_free_pages_zone() (See Section J.5.3)
if it needs re-balancing
647 node_zones is an array and i is an index within it
648-649 Call schedule() if the quanta is expired to prevent kswapd hogging the
CPU
650-651 If the zone does not require balance, move to the next one
652-657
If the function returns 0, it means the
out_of_memory()
called because a sucient number of pages could not be freed.
function was
kswapd sleeps
for 1 second to give the system a chance to reclaim the killed processes pages
and perform IO. The zone is marked as balanced so
zone until the the allocator function
658-661
If is was successful,
kswapd
__alloc_pages()
complains again
check_classzone_need_balance()
if the zone requires further balancing or not
664 Return 1 if one zone requires further balancing
will ignore this
is called to see
Appendix K
Swap Management
Contents
K.1 Scanning for Free Entries . . . . . . . . . . . . . . . . . . . . . . . 572
K.1.1 Function: get_swap_page() . . . . . . . . . . . . . . . . . . . . . 572
K.1.2 Function: scan_swap_map() . . . . . . . . . . . . . . . . . . . . . 574
K.2 Swap Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
K.2.1 Adding Pages to the Swap Cache . . . . .
K.2.1.1 Function: add_to_swap_cache()
K.2.1.2 Function: swap_duplicate() . .
K.2.2 Deleting Pages from the Swap Cache . . .
K.2.2.1 Function: swap_free() . . . . .
K.2.2.2 Function: swap_entry_free() .
K.2.3 Acquiring/Releasing Swap Cache Pages .
K.2.3.1 Function: swap_info_get() . .
K.2.3.2 Function: swap_info_put() . .
K.2.4 Searching the Swap Cache . . . . . . . . .
K.2.4.1 Function: lookup_swap_cache()
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
577
577
578
580
580
580
581
581
582
583
583
K.3.1 Reading Backing Storage . . . . . . . . . . . . . . .
K.3.1.1 Function: read_swap_cache_async() . . .
K.3.2 Writing Backing Storage . . . . . . . . . . . . . . . .
K.3.2.1 Function: swap_writepage() . . . . . . . .
K.3.2.2 Function: remove_exclusive_swap_page()
K.3.2.3 Function: free_swap_and_cache() . . . .
K.3.3 Block IO . . . . . . . . . . . . . . . . . . . . . . . . .
K.3.3.1 Function: rw_swap_page() . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
584
584
586
586
586
588
589
589
K.3 Swap Area IO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
570
APPENDIX K. SWAP MANAGEMENT
571
K.3.3.2 Function: rw_swap_page_base() . . . . . . . . . . . . . 590
K.3.3.3 Function: get_swaphandle_info() . . . . . . . . . . . 592
K.4 Activating a Swap Area . . . . . . . . . . . . . . . . . . . . . . . . 594
K.4.1 Function: sys_swapon() . . . . . . . . . . . . . . . . . . . . . . . 594
K.4.2 Function: swap_setup() . . . . . . . . . . . . . . . . . . . . . . . 605
K.5 Deactivating a Swap Area . . . . . . . . . . . . . . . . . . . . . . 606
K.5.1
K.5.2
K.5.3
K.5.4
K.5.5
K.5.6
K.5.7
Function:
Function:
Function:
Function:
Function:
Function:
Function:
sys_swapoff() . . . . . . . . . . . . . . . . . . . . . . 606
try_to_unuse() . . . . . . . . . . . . . . . . . . . . . 610
unuse_process() . . . . . . . . . . . . . . . . . . . . . 615
unuse_vma() . . . . . . . . . . . . . . . . . . . . . . . 616
unuse_pgd() . . . . . . . . . . . . . . . . . . . . . . . 616
unuse_pmd() . . . . . . . . . . . . . . . . . . . . . . . 618
unuse_pte() . . . . . . . . . . . . . . . . . . . . . . . 619
K.1 Scanning for Free Entries
572
K.1 Scanning for Free Entries
Contents
K.1 Scanning for Free Entries
K.1.1 Function: get_swap_page()
K.1.2 Function: scan_swap_map()
K.1.1
Function: get_swap_page()
572
572
574
(mm/swaple.c)
The call graph for this function is shown in Figure 11.2. This is the high level
API function for searching the swap areas for a free swap lot and returning the
resulting
swp_entry_t.
99 swp_entry_t get_swap_page(void)
100 {
101
struct swap_info_struct * p;
102
unsigned long offset;
103
swp_entry_t entry;
104
int type, wrapped = 0;
105
106
entry.val = 0; /* Out of memory */
107
swap_list_lock();
108
type = swap_list.next;
109
if (type < 0)
110
goto out;
111
if (nr_swap_pages <= 0)
112
goto out;
113
114
while (1) {
115
p = &swap_info[type];
116
if ((p->flags & SWP_WRITEOK) == SWP_WRITEOK) {
117
swap_device_lock(p);
118
offset = scan_swap_map(p);
119
swap_device_unlock(p);
120
if (offset) {
121
entry = SWP_ENTRY(type,offset);
122
type = swap_info[type].next;
123
if (type < 0 ||
124
p->prio != swap_info[type].prio) {
125
swap_list.next = swap_list.head;
126
} else {
127
swap_list.next = type;
128
}
129
goto out;
130
}
131
}
K.1 Scanning for Free Entries (get_swap_page())
573
132
type = p->next;
133
if (!wrapped) {
134
if (type < 0 || p->prio != swap_info[type].prio) {
135
type = swap_list.head;
136
wrapped = 1;
137
}
138
} else
139
if (type < 0)
140
goto out;
/* out of swap space */
141
}
142 out:
143
swap_list_unlock();
144
return entry;
145 }
107 Lock the list of swap areas
108
Get the next swap area that is to be used for allocating from. This list will
be ordered depending on the priority of the swap areas
109-110 If there are no swap areas, return NULL
111-112 If the accounting says there are no available swap slots, return NULL
114-141 Cycle through all swap areas
115 Get the current swap info struct from the swap_info array
116 If this swap area is available for writing to and is active...
117 Lock the swap area
118 Call scan_swap_map()(See Section K.1.2) which searches the requested swap
map for a free slot
119 Unlock the swap device
120-130 If a slot was free...
121 Encode an identier for the entry with SWP_ENTRY()
122 Record the next swap area to use
123-126 If the next area is the end of the list or the priority of the next swap area
does not match the current one, move back to the head
126-128 Otherwise move to the next area
129 Goto out
K.1 Scanning for Free Entries (get_swap_page())
574
132 Move to the next swap area
133-138 Check for wrapaound.
Set
wrapped to 1 if we get to the end of the list of
swap areas
139-140 If there was no available swap areas, goto out
142 The exit to this function
143 Unlock the swap area list
144 Return the entry if one was found and NULL otherwise
K.1.2
Function: scan_swap_map()
This function tries to allocate
(mm/swaple.c)
SWAPFILE_CLUSTER number
of pages sequentially
in swap. When it has allocated that many, it searches for another block of free slots
of size
slot.
SWAPFILE_CLUSTER. If it fails to nd one, it resorts to allocating the rst free
This clustering attempts to make sure that slots are allocated and freed in
SWAPFILE_CLUSTER
sized chunks.
36 static inline int scan_swap_map(struct swap_info_struct *si)
37 {
38
unsigned long offset;
47
if (si->cluster_nr) {
48
while (si->cluster_next <= si->highest_bit) {
49
offset = si->cluster_next++;
50
if (si->swap_map[offset])
51
continue;
52
si->cluster_nr--;
53
goto got_page;
54
}
55
}
SWAPFILE_CLUSTER pages sequentially. cluster_nr
SWAPFILE_CLUTER and decrements with each allocation
Allocate
is initialised to
47 If cluster_nr is still postive, allocate the next available sequential slot
48 While the current oset to use (cluster_next) is less then the highest known
free slot (highest_bit) then ...
49 Record the oset and update cluster_next to the next free slot
50-51 If the slot is not actually free, move to the next one
52 Slot has been found, decrement the cluster_nr eld
53 Goto the out path
K.1 Scanning for Free Entries (scan_swap_map())
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
575
si->cluster_nr = SWAPFILE_CLUSTER;
/* try to find an empty (even not aligned) cluster. */
offset = si->lowest_bit;
check_next_cluster:
if (offset+SWAPFILE_CLUSTER-1 <= si->highest_bit)
{
int nr;
for (nr = offset; nr < offset+SWAPFILE_CLUSTER; nr++)
if (si->swap_map[nr])
{
offset = nr+1;
goto check_next_cluster;
}
/* We found a completly empty cluster, so start
* using it.
*/
goto got_page;
}
SWAPFILE_CLUSTER pages have
block of SWAPFILE_CLUSTER pages.
At this stage,
the next free
been allocated sequentially so nd
56 Re-initialise the count of sequential pages to allocate to SWAPFILE_CLUSTER
59 Starting searching at the lowest known free slot
61
If the oset plus the cluster size is less than the known last free slot, then
examine all the pages to see if this is a large free block
64 Scan from offset to offset + SWAPFILE_CLUSTER
65-69 If this slot is used, then start searching again for a free slot beginning after
this known alloated one
73 A large cluster was found so use it
75
76
77
78
79
/* No luck, so now go finegrined as usual. -Andrea */
for (offset = si->lowest_bit; offset <= si->highest_bit ;
offset++) {
if (si->swap_map[offset])
continue;
si->lowest_bit = offset+1;
This unusual for loop extract starts scanning for a free page starting from
lowest_bit
77-78 If the slot is in use, move to the next one
K.1 Scanning for Free Entries (scan_swap_map())
576
79 Update the lowest_bit known probable free slot to the succeeding one
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97 }
got_page:
if (offset == si->lowest_bit)
si->lowest_bit++;
if (offset == si->highest_bit)
si->highest_bit--;
if (si->lowest_bit > si->highest_bit) {
si->lowest_bit = si->max;
si->highest_bit = 0;
}
si->swap_map[offset] = 1;
nr_swap_pages--;
si->cluster_next = offset+1;
return offset;
}
si->lowest_bit = si->max;
si->highest_bit = 0;
return 0;
A slot has been found, do some housekeeping and return it
81-82 If this oset is the known lowest free slot(lowest_bit), increment it
83-84 If this oset is the highest known likely free slot, decrement it
85-88
If the low and high mark meet, the swap area is not worth searching any
more because these marks represent the lowest and highest known free slots.
Set the low slot to be the highest possible slot and the high mark to 0 to cut
down on search time later. This will be xed up the next time a slot is freed
89 Set the reference count for the slot
90 Update the accounting for the number of available swap pages (nr_swap_pages)
91 Set cluster_next to the adjacent slot so the next search will start here
92 Return the free slot
94-96 No free slot available, mark the area unsearchable and return 0
K.2 Swap Cache
577
K.2 Swap Cache
Contents
K.2 Swap Cache
K.2.1 Adding Pages to the Swap Cache
K.2.1.1 Function: add_to_swap_cache()
K.2.1.2 Function: swap_duplicate()
K.2.2 Deleting Pages from the Swap Cache
K.2.2.1 Function: swap_free()
K.2.2.2 Function: swap_entry_free()
K.2.3 Acquiring/Releasing Swap Cache Pages
K.2.3.1 Function: swap_info_get()
K.2.3.2 Function: swap_info_put()
K.2.4 Searching the Swap Cache
K.2.4.1 Function: lookup_swap_cache()
577
577
577
578
580
580
580
581
581
582
583
583
K.2.1 Adding Pages to the Swap Cache
K.2.1.1
Function: add_to_swap_cache()
(mm/swap_state.c)
The call graph for this function is shown in Figure 11.3.
wraps around the normal page cache handler.
ready in the swap cache with
add_to_page_cache_unique()
This function
It rst checks if the page is al-
swap_duplicate()
and if it does not, it calls
instead.
70 int add_to_swap_cache(struct page *page, swp_entry_t entry)
71 {
72
if (page->mapping)
73
BUG();
74
if (!swap_duplicate(entry)) {
75
INC_CACHE_INFO(noent_race);
76
return -ENOENT;
77
}
78
if (add_to_page_cache_unique(page, &swapper_space, entry.val,
79
page_hash(&swapper_space, entry.val)) != 0) {
80
swap_free(entry);
81
INC_CACHE_INFO(exist_race);
82
return -EEXIST;
83
}
84
if (!PageLocked(page))
85
BUG();
86
if (!PageSwapCache(page))
87
BUG();
88
INC_CACHE_INFO(add_total);
89
return 0;
90 }
K.2.1 Adding Pages to the Swap Cache (add_to_swap_cache())
72-73
A check is made with
PageSwapCache()
578
before this function is called to
make sure the page is not already in the swap cache. This check here ensures
the page has no other existing mapping in case the caller was careless and did
not make the check
74-77 Use swap_duplicate() (See Section K.2.1.2) to try an increment the count
for this entry. If a slot already exists in the
swap_map,
increment the statistic
recording the number of races involving adding pages to the swap cache and
return
78
-ENOENT
Try and add the page to the page cache with
(See Section J.1.1.2).
add_to_page_cache_unique()
add_to_page_cache()
This function is similar to
(See Section J.1.1.1) except it searches the page cache for a duplicate entry
swapper_space.
The oset within the le in this case is the oset within swap_map, hence
entry.val and nally the page is hashed based on address_space and oset
within swap_map
with
80-83
__find_page_nolock().
The managing address space is
If it already existed in the page cache, we raced so increment the statistic
recording the number of races to insert an existing page into the swap cache
and return
EEXIST
84-85 If the page is locked for IO, it is a bug
86-87 If it is not now in the swap cache, something went seriously wrong
88 Increment the statistic recording the total number of pages in the swap cache
89 Return success
K.2.1.2
Function: swap_duplicate()
(mm/swaple.c)
This function veries a swap entry is valid and if so, increments its swap map
count.
1161 int swap_duplicate(swp_entry_t entry)
1162 {
1163
struct swap_info_struct * p;
1164
unsigned long offset, type;
1165
int result = 0;
1166
1167
type = SWP_TYPE(entry);
1168
if (type >= nr_swapfiles)
1169
goto bad_file;
1170
p = type + swap_info;
1171
offset = SWP_OFFSET(entry);
1172
1173
swap_device_lock(p);
K.2.1 Adding Pages to the Swap Cache (swap_duplicate())
1174
1175
1176
1177
1178
1179
1180
579
if (offset < p->max && p->swap_map[offset]) {
if (p->swap_map[offset] < SWAP_MAP_MAX - 1) {
p->swap_map[offset]++;
result = 1;
} else if (p->swap_map[offset] <= SWAP_MAP_MAX) {
if (swap_overflow++ < 5)
printk(KERN_WARNING "swap_dup: swap entry
overflow\n");
p->swap_map[offset] = SWAP_MAP_MAX;
result = 1;
}
}
swap_device_unlock(p);
out:
return result;
1181
1182
1183
1184
1185
1186
1187
1188
1189 bad_file:
1190
printk(KERN_ERR "swap_dup: %s%08lx\n", Bad_file, entry.val);
1191
goto out;
1192 }
1161 The parameter is the swap entry to increase the swap_map count for
1167-1169
taining this entry.
1170-1171
swap_info_struct conIf it is greater than the number of swap areas, goto bad_file
Get the oset within the
Get the relevant
swap_map
swap_info
for the
swap_info_struct
and get the oset within its
1173 Lock the swap device
1174
Make a quick sanity check to ensure the oset is within the
swap_map
and
that the slot indicated has a positive count. A 0 count would mean the slot is
not free and this is a bogus
swp_entry_t
1175-1177 If the count is not SWAP_MAP_MAX, simply increment it and return 1 for
success
1178-1183
Else the count would overow so set it to
SWAP_MAP_MAX
and reserve
the slot permanently. In reality this condition is virtually impossible
1185-1187 Unlock the swap device and return
1190-1191 If a bad device was used, print out the error message and return failure
K.2.2 Deleting Pages from the Swap Cache
580
K.2.2 Deleting Pages from the Swap Cache
K.2.2.1
Function: swap_free()
Decrements the corresponding
(mm/swaple.c)
swap_map entry for the swp_entry_t
214 void swap_free(swp_entry_t entry)
215 {
216
struct swap_info_struct * p;
217
218
p = swap_info_get(entry);
219
if (p) {
220
swap_entry_free(p, SWP_OFFSET(entry));
221
swap_info_put(p);
222
}
223 }
218 swap_info_get() (See Section K.2.3.1) fetches the correct swap_info_struct
and performs a number of debugging checks to ensure it is a valid area and a
valid
219-222
swap_map
entry. If all is sane, it will lock the swap device
swap_map entry is decremented with
K.2.2.2) and swap_info_put() (See Section
If it is valid, the corresponding
swap_entry_free() (See
Section
called to free the device
K.2.2.2
Function: swap_entry_free()
(mm/swaple.c)
192 static int swap_entry_free(struct swap_info_struct *p,
unsigned long offset)
193 {
194
int count = p->swap_map[offset];
195
196
if (count < SWAP_MAP_MAX) {
197
count--;
198
p->swap_map[offset] = count;
199
if (!count) {
200
if (offset < p->lowest_bit)
201
p->lowest_bit = offset;
202
if (offset > p->highest_bit)
203
p->highest_bit = offset;
204
nr_swap_pages++;
205
}
206
}
207
return count;
208 }
194 Get the current count
K.2.3.2)
K.2.3 Acquiring/Releasing Swap Cache Pages
581
196 If the count indicates the slot is not permanently reserved then..
197-198 Decrement the count and store it in the swap_map
199 If the count reaches 0, the slot is free so update some information
200-201 If this freed slot is below lowest_bit, update lowest_bit which indicates
the lowest known free slot
202-203 Similarly, update the highest_bit if this newly freed slot is above it
204 Increment the count indicating the number of free swap slots
207 Return the current count
K.2.3 Acquiring/Releasing Swap Cache Pages
K.2.3.1
Function: swap_info_get()
This function nds the
(mm/swaple.c)
swap_info_struct for the given
entry, performs some
basic checking and then locks the device.
147 static struct swap_info_struct * swap_info_get(swp_entry_t entry)
148 {
149
struct swap_info_struct * p;
150
unsigned long offset, type;
151
152
if (!entry.val)
153
goto out;
154
type = SWP_TYPE(entry);
155
if (type >= nr_swapfiles)
156
goto bad_nofile;
157
p = & swap_info[type];
158
if (!(p->flags & SWP_USED))
159
goto bad_device;
160
offset = SWP_OFFSET(entry);
161
if (offset >= p->max)
162
goto bad_offset;
163
if (!p->swap_map[offset])
164
goto bad_free;
165
swap_list_lock();
166
if (p->prio > swap_info[swap_list.next].prio)
167
swap_list.next = type;
168
swap_device_lock(p);
169
return p;
170
171 bad_free:
172
printk(KERN_ERR "swap_free: %s%08lx\n", Unused_offset,
K.2.3 Acquiring/Releasing Swap Cache Pages (swap_info_get())
582
entry.val);
173
goto out;
174 bad_offset:
175
printk(KERN_ERR "swap_free: %s%08lx\n", Bad_offset,
entry.val);
176
goto out;
177 bad_device:
178
printk(KERN_ERR "swap_free: %s%08lx\n", Unused_file,
entry.val);
179
goto out;
180 bad_nofile:
181
printk(KERN_ERR "swap_free: %s%08lx\n", Bad_file,
entry.val);
182 out:
183
return NULL;
184 }
152-153 If the supplied entry is NULL, return
154 Get the oset within the swap_info array
155-156 Ensure it is a valid area
157 Get the address of the area
158-159 If the area is not active yet, print a bad device error and return
160 Get the oset within the swap_map
161-162 Make sure the oset is not after the end of the map
163-164 Make sure the slot is currently in use
165 Lock the swap area list
166-167 If this area is of higher priority than the area that would be next, ensure
the current area is used
168-169 Lock the swap device and return the swap area descriptor
K.2.3.2
Function: swap_info_put()
(mm/swaple.c)
This function simply unlocks the area and list
186 static void swap_info_put(struct swap_info_struct * p)
187 {
188
swap_device_unlock(p);
189
swap_list_unlock();
190 }
188 Unlock the device
189 Unlock the swap area list
K.2.4 Searching the Swap Cache
583
K.2.4 Searching the Swap Cache
K.2.4.1
Function: lookup_swap_cache()
(mm/swap_state.c)
Top level function for nding a page in the swap cache
161 struct page * lookup_swap_cache(swp_entry_t entry)
162 {
163
struct page *found;
164
165
found = find_get_page(&swapper_space, entry.val);
166
/*
167
* Unsafe to assert PageSwapCache and mapping on page found:
168
* if SMP nothing prevents swapoff from deleting this page from
169
* the swap cache at this moment. find_lock_page would prevent
170
* that, but no need to change: we _have_ got the right page.
171
*/
172
INC_CACHE_INFO(find_total);
173
if (found)
174
INC_CACHE_INFO(find_success);
175
return found;
176 }
165 find_get_page()(See Section J.1.4.1)
the
struct page.
is the principle function for returning
It uses the normal page hashing and cache functions for
quickly nding it
172
Increase the statistic recording the number of times a page was searched for
in the cache
173-174 If one was found, increment the successful nd count
175 Return the struct page or NULL if it did not exist
K.3 Swap Area IO
584
K.3 Swap Area IO
Contents
K.3 Swap Area IO
K.3.1 Reading Backing Storage
K.3.1.1 Function: read_swap_cache_async()
K.3.2 Writing Backing Storage
K.3.2.1 Function: swap_writepage()
K.3.2.2 Function: remove_exclusive_swap_page()
K.3.2.3 Function: free_swap_and_cache()
K.3.3 Block IO
K.3.3.1 Function: rw_swap_page()
K.3.3.2 Function: rw_swap_page_base()
K.3.3.3 Function: get_swaphandle_info()
584
584
584
586
586
586
588
589
589
590
592
K.3.1 Reading Backing Storage
K.3.1.1
Function: read_swap_cache_async()
(mm/swap_state.c)
This function will either return the requsted page from the swap cache.
If it
does not exist, a page will be allocated, placed in the swap cache and the data is
scheduled to be read from disk with
rw_swap_page().
184 struct page * read_swap_cache_async(swp_entry_t entry)
185 {
186
struct page *found_page, *new_page = NULL;
187
int err;
188
189
do {
196
found_page = find_get_page(&swapper_space, entry.val);
197
if (found_page)
198
break;
199
200
/*
201
* Get a new page to read into from swap.
202
*/
203
if (!new_page) {
204
new_page = alloc_page(GFP_HIGHUSER);
205
if (!new_page)
206
break;
/* Out of memory */
207
}
208
209
/*
210
* Associate the page with swap entry in the swap cache.
211
* May fail (-ENOENT) if swap entry has been freed since
212
* our caller observed it. May fail (-EEXIST) if there
K.3.1 Reading Backing Storage (read_swap_cache_async())
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231 }
585
* is already a page associated with this entry in the
* swap cache: added by a racing read_swap_cache_async,
* or by try_to_swap_out (or shmem_writepage) re-using
* the just freed swap entry for an existing page.
*/
err = add_to_swap_cache(new_page, entry);
if (!err) {
/*
* Initiate read into locked page and return.
*/
rw_swap_page(READ, new_page);
return new_page;
}
} while (err != -ENOENT);
if (new_page)
page_cache_release(new_page);
return found_page;
189 Loop in case add_to_swap_cache() fails to add a page to the swap cache
196
First search the swap cache with
find_get_page()(See Section J.1.4.1) to
Ordinarily, lookup_swap_cache()
see if the page is already avaialble.
(See Section K.2.4.1) would be called but it updates statistics (such as the
number of cache searches) so
find_get_page()
(See Section J.1.4.1) is called
directly
203-207
If the page is not in the swap cache and we have not allocated one yet,
allocate one with
alloc_page()
218 Add the newly allocated page to the swap cache with add_to_swap_cache()
(See Section K.2.1.1)
223 Schedule the data to be read with rw_swap_page()(See Section K.3.3.1).
The
page will be returned locked and will be unlocked when IO completes
224 Return the new page
226
Loop until
add_to_swap_cache()
succeeds or another process successfully
inserts the page into the swap cache
228-229 This is either the error path or another process added the page to the swap
cache for us. If a new page was allocated, free it with
page_cache_release()
(See Section J.1.3.2)
230 Return either the page found in the swap cache or an error
K.3.2 Writing Backing Storage
586
K.3.2 Writing Backing Storage
K.3.2.1
is
Function: swap_writepage()
(mm/swap_state.c)
This is the function registered in swap_aops for writing out pages. It's function
pretty simple. First it calls remove_exclusive_swap_page() to try and free the
page. If the page was freed, then the page will be unlocked here before returning as
there is no IO pending on the page. Otherwise
rw_swap_page()
is called to sync
the page with backing storage.
24 static int swap_writepage(struct page *page)
25 {
26
if (remove_exclusive_swap_page(page)) {
27
UnlockPage(page);
28
return 0;
29
}
30
rw_swap_page(WRITE, page);
31
return 0;
32 }
26-29 remove_exclusive_swap_page()(See Section K.3.2.2) will reclaim the page
from the swap cache if possible.
If the page is reclaimed, unlock it before
returning
30
Otherwise the page is still in the swap cache so synchronise it with backing
storage by calling
K.3.2.2
rw_swap_page()
(See Section K.3.3.1)
Function: remove_exclusive_swap_page()
(mm/swaple.c)
This function will tries to work out if there is other processes sharing this page
or not. If possible the page will be removed from the swap cache and freed. Once
removed from the swap cache,
swap_free()
is decremented to indicate that the
swap cache is no longer using the slot. The count will instead reect the number of
PTEs that contain a
swp_entry_t
for this slot.
287 int remove_exclusive_swap_page(struct page *page)
288 {
289
int retval;
290
struct swap_info_struct * p;
291
swp_entry_t entry;
292
293
if (!PageLocked(page))
294
BUG();
295
if (!PageSwapCache(page))
296
return 0;
297
if (page_count(page) - !!page->buffers != 2) /* 2: us + cache */
298
return 0;
299
K.3.2 Writing Backing Storage (remove_exclusive_swap_page())
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326 }
587
entry.val = page->index;
p = swap_info_get(entry);
if (!p)
return 0;
/* Is the only swap cache user the cache itself? */
retval = 0;
if (p->swap_map[SWP_OFFSET(entry)] == 1) {
/* Recheck the page count with the pagecache lock held.. */
spin_lock(&pagecache_lock);
if (page_count(page) - !!page->buffers == 2) {
__delete_from_swap_cache(page);
SetPageDirty(page);
retval = 1;
}
spin_unlock(&pagecache_lock);
}
swap_info_put(p);
if (retval) {
block_flushpage(page, 0);
swap_free(entry);
page_cache_release(page);
}
return retval;
293-294 This operation should only be made with the page locked
295-296 If the page is not in the swap cache, then there is nothing to do
297-298 If there are other users of the page, then it cannot be reclaimed so return
300
The
swp_entry_t
for the page is stored in
page→index
as explained in
Section 2.4
301 Get the swap_info_struct with swap_info_get() (See Section K.2.3.1)
307
If the only user of the swap slot is the swap cache itself (i.e.
no process is
mapping it), then delete this page from the swap cache to free the slot. Later
the swap slot usage count will be decremented as the swap cache is no longer
using it
310 If the current user is the only user of this page, then it is safe to remove from
the swap cache. If another process is sharing it, it must remain here
K.3.2 Writing Backing Storage (remove_exclusive_swap_page())
588
311 Delete from the swap cache
313
317
Set retval to 1 so that
swap_free() (See Section
in the swap_map
the caller knows the page was freed and so that
K.2.2.1) will be called to decrement the usage count
Drop the reference to the swap slot that was taken with
swap_info_get()
(See Section K.2.3.1)
320 The slot is being freed to call block_flushpage() so that all IO will complete
and any buers associated with the page will be freed
321 Free the swap slot with swap_free()
322 Drop the reference to the page
K.3.2.3
Function: free_swap_and_cache()
(mm/swaple.c)
This function frees an entry from the swap cache and tries to reclaims the page.
Note that this function only applies to the swap cache.
332 void free_swap_and_cache(swp_entry_t entry)
333 {
334
struct swap_info_struct * p;
335
struct page *page = NULL;
336
337
p = swap_info_get(entry);
338
if (p) {
339
if (swap_entry_free(p, SWP_OFFSET(entry)) == 1)
340
page = find_trylock_page(&swapper_space, entry.val);
341
swap_info_put(p);
342
}
343
if (page) {
344
page_cache_get(page);
345
/* Only cache user (+us), or swap space full? Free it! */
346
if (page_count(page) - !!page->buffers == 2 || vm_swap_full()) {
347
delete_from_swap_cache(page);
348
SetPageDirty(page);
349
}
350
UnlockPage(page);
351
page_cache_release(page);
352
}
353 }
337 Get the swap_info struct for the requsted entry
338-342 Presuming the swap area information struct exists, call swap_entry_free()
to free the swap entry. The page for the
cache using
find_trylock_page().
entry
is then located in the swap
Note that the page is returned locked
K.3.3 Block IO
589
341 Drop the reference taken to the swap info struct at line 337
343-352 If the page was located then we try to reclaim it
344 Take a reference to the page so it will not be freed prematurly
346-349
The page is deleted from the swap cache if there are no processes
mapping the page or if the swap area is more than 50% full (Checked by
vm_swap_full())
350 Unlock the page again
351 Drop the local reference to the page taken at line 344
K.3.3 Block IO
K.3.3.1
Function: rw_swap_page()
(mm/page_io.c)
This is the main function used for reading data from backing storage into a
page or writing data from a page to backing storage. Which operation is performs
rw. It is basically a wrapper function around the
rw_swap_page_base(). This simply enforces that the operations are
depends on the rst parameter
core function
only performed on pages in the swap cache.
85 void rw_swap_page(int rw, struct page *page)
86 {
87
swp_entry_t entry;
88
89
entry.val = page->index;
90
91
if (!PageLocked(page))
92
PAGE_BUG(page);
93
if (!PageSwapCache(page))
94
PAGE_BUG(page);
95
if (!rw_swap_page_base(rw, entry, page))
96
UnlockPage(page);
97 }
85 rw indicates whether a read or write is taking place
89 Get the swp_entry_t from the index eld
91-92 If the page is not locked for IO, it is a bug
93-94 If the page is not in the swap cache, it is a bug
95
rw_swap_page_base(). If
UnlockPage() so it can be freed
Call the core function
unlocked with
it returns failure, the page is
K.3.3.2 Function: rw_swap_page_base()
K.3.3.2
Function: rw_swap_page_base()
590
(mm/page_io.c)
This is the core function for reading or writing data to the backing storage.
Whether it is writing to a partition or a le, the block layer
brw_page() function is
used to perform the actual IO. This function sets up the necessary buer information
for the block layer to do it's job. The
brw_page()
performs asynchronous IO so it
is likely it will return with the page locked which will be unlocked when the IO
completes.
36 static int rw_swap_page_base(int rw, swp_entry_t entry,
struct page *page)
37 {
38
unsigned long offset;
39
int zones[PAGE_SIZE/512];
40
int zones_used;
41
kdev_t dev = 0;
42
int block_size;
43
struct inode *swapf = 0;
44
45
if (rw == READ) {
46
ClearPageUptodate(page);
47
kstat.pswpin++;
48
} else
49
kstat.pswpout++;
50
36 The parameters are:
rw indicates whether the operation is a read or a write
entry is the swap entry for locating the data in backing storage
page is the page that is been read or written to
39 zones is a parameter required by the block layer for brw_page().
It is expected
to contain an array of block numbers that are to be written to. This is primarily
of important when the backing storage is a le rather than a partition
45-47
If the page is to be read from disk, clear the
Uptodate
ag as the page is
obviously not up to date if we are reading information from the disk. Increment
the pages swapped in (pswpin) statistic
49 Else just update the pages swapped out (pswpout) statistic
51
52
53
54
55
get_swaphandle_info(entry, &offset, &dev, &swapf);
if (dev) {
zones[0] = offset;
zones_used = 1;
block_size = PAGE_SIZE;
K.3.3 Block IO (rw_swap_page_base())
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76 }
591
} else if (swapf) {
int i, j;
unsigned int block =
offset << (PAGE_SHIFT - swapf->i_sb->s_blocksize_bits);
block_size = swapf->i_sb->s_blocksize;
for (i=0, j=0; j< PAGE_SIZE ; i++, j += block_size)
if (!(zones[i] = bmap(swapf,block++))) {
printk("rw_swap_page: bad swap file\n");
return 0;
}
zones_used = i;
dev = swapf->i_dev;
} else {
return 0;
}
/* block_size == PAGE_SIZE/zones_used */
brw_page(rw, page, dev, zones, block_size);
return 1;
51 get_swaphandle_info()(See Section K.3.3.3)
struct inode
returns either the
kdev_t
or
that represents the swap area, whichever is appropriate
52-55 If the storage area is a partition, then there is only one block to be written
zones only has one entry which is the oset
written and the block_size is PAGE_SIZE
which is the size of a page. Hence,
within the partition to be
56 Else it is a swap le so each of the blocks in the le that make up the page has
to be mapped with
bmap()
before calling
brw_page()
58-59 Calculate what the starting block is
61
The size of individual block is stored in the superblock information for the
lesystem the le resides on
62-66
Call
stored
bmap() for every block that makes up the full page. Each block is
in the zones array for passing to brw_page(). If any block fails to be
mapped, 0 is returned
67 Record how many blocks make up the page in zones_used
68 Record which device is being written to
74 Call brw_page() from the block layer to schedule the IO to occur.
This function
returns immediately as the IO is asychronous. When the IO is completed, a
K.3.3 Block IO (rw_swap_page_base())
592
callback function (end_buffer_io_async()) is called which unlocks the page.
Any process waiting on the page will be woken up at that point
75 Return success
K.3.3.3
Function: get_swaphandle_info()
This function is responsible for returning
that is managing the swap area that
entry
(mm/swaple.c)
either the kdev_t or struct inode
belongs to.
1197 void get_swaphandle_info(swp_entry_t entry, unsigned long *offset,
1198
kdev_t *dev, struct inode **swapf)
1199 {
1200
unsigned long type;
1201
struct swap_info_struct *p;
1202
1203
type = SWP_TYPE(entry);
1204
if (type >= nr_swapfiles) {
1205
printk(KERN_ERR "rw_swap_page: %s%08lx\n", Bad_file,
entry.val);
1206
return;
1207
}
1208
1209
p = &swap_info[type];
1210
*offset = SWP_OFFSET(entry);
1211
if (*offset >= p->max && *offset != 0) {
1212
printk(KERN_ERR "rw_swap_page: %s%08lx\n", Bad_offset,
entry.val);
1213
return;
1214
}
1215
if (p->swap_map && !p->swap_map[*offset]) {
1216
printk(KERN_ERR "rw_swap_page: %s%08lx\n", Unused_offset,
entry.val);
1217
return;
1218
}
1219
if (!(p->flags & SWP_USED)) {
1220
printk(KERN_ERR "rw_swap_page: %s%08lx\n", Unused_file,
entry.val);
1221
return;
1222
}
1223
1224
if (p->swap_device) {
1225
*dev = p->swap_device;
1226
} else if (p->swap_file) {
1227
*swapf = p->swap_file->d_inode;
1228
} else {
K.3.3 Block IO (get_swaphandle_info())
1229
1230
1231
1232 }
593
printk(KERN_ERR "rw_swap_page: no swap file or device\n");
}
return;
1203 Extract which area within swap_info this entry belongs to
1204-1206 If the index is for an area that does not exist, then print out an information message and return.
of
mm/swapfile.c
Bad_file
is a static array declared near the top
that says Bad swap le entry
1209 Get the swap_info_struct from swap_info
1210 Extrac the oset within the swap area for this entry
1211-1214
Make sure the oset is not after the end of the le.
message in
Bad_offset
Print out the
if it is
1215-1218 If the oset is currently not being used, it means that entry is a stale
entry so print out the error message in
Unused_offset
1219-1222 If the swap area is currently not active, print out the error message in
Unused_file
1224 If the swap area is a device, return the kdev_t in swap_info_struct→swap_device
1226-1227
struct inode
swap_info_struct→swap_file→d_inode
1229
If it is a swap le, return the
Else there is no swap le or device for this
message and return
entry
which is available via
so print out the error
K.4 Activating a Swap Area
594
K.4 Activating a Swap Area
Contents
K.4 Activating a Swap Area
K.4.1 Function: sys_swapon()
K.4.2 Function: swap_setup()
K.4.1
594
594
605
Function: sys_swapon()
(mm/swaple.c)
This, quite large, function is responsible for the activating of swap space. Broadly
speaking the tasks is takes are as follows;
•
Find a free
swap_info_struct
in the
swap_info
array an initialise it with
default values
user_path_walk() which traverses the directory tree for the supplied
specialfile and populates a namidata structure with the available data on
the le, such as the dentry and the lesystem information for where it is
stored (vfsmount)
•
Call
•
Populate
swap_info_struct
area and how to nd it.
be congured to the
elds pertaining to the dimensions of the swap
If the swap area is a partition, the block size will
PAGE_SIZE
before calculating the size. If it is a le, the
information is obtained directly from the
•
inode
Ensure the area is not already activated. If not, allocate a page from memory and read the rst page sized slot from the swap area.
This page con-
tains information such as the number of good slots and how to populate the
swap_info_struct→swap_map
•
with the bad entries
vmalloc() for swap_info_struct→swap_map and iniwith 0 for good slots and SWAP_MAP_BAD otherwise. Ideally
Allocate memory with
tialise each entry
the header information will be a version 2 le format as version 1 was limited
to swap areas of just under 128MiB for architectures with 4KiB page sizes like
the x86
•
After ensuring the information indicated in the header matches the actual
swap area, ll in the remaining information in the
swap_info_struct
such
as the maximum number of pages and the available good pages. Update the
global statistics for
•
nr_swap_pages
and
total_swap_pages
The swap area is now fully active and initialised and so it is inserted into the
swap list in the correct position based on priority of the newly activated area
855 asmlinkage long sys_swapon(const char * specialfile,
int swap_flags)
856 {
857
struct swap_info_struct * p;
K.4 Activating a Swap Area (sys_swapon())
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
595
struct nameidata nd;
struct inode * swap_inode;
unsigned int type;
int i, j, prev;
int error;
static int least_priority = 0;
union swap_header *swap_header = 0;
int swap_header_version;
int nr_good_pages = 0;
unsigned long maxpages = 1;
int swapfilesize;
struct block_device *bdev = NULL;
unsigned short *swap_map;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
lock_kernel();
swap_list_lock();
p = swap_info;
855 The two parameters are the path to the swap area and the ags for activation
872-873
The activating process must have the
CAP_SYS_ADMIN
capability or be
the superuser to activate a swap area
874 Acquire the Big Kernel Lock
875 Lock the list of swap areas
876 Get the rst swap area in the swap_info array
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
for (type = 0 ; type < nr_swapfiles ; type++,p++)
if (!(p->flags & SWP_USED))
break;
error = -EPERM;
if (type >= MAX_SWAPFILES) {
swap_list_unlock();
goto out;
}
if (type >= nr_swapfiles)
nr_swapfiles = type+1;
p->flags = SWP_USED;
p->swap_file = NULL;
p->swap_vfsmnt = NULL;
p->swap_device = 0;
p->swap_map = NULL;
K.4 Activating a Swap Area (sys_swapon())
892
893
894
895
896
897
898
899
900
901
902
903
596
p->lowest_bit = 0;
p->highest_bit = 0;
p->cluster_nr = 0;
p->sdev_lock = SPIN_LOCK_UNLOCKED;
p->next = -1;
if (swap_flags & SWAP_FLAG_PREFER) {
p->prio =
(swap_flags & SWAP_FLAG_PRIO_MASK)>>SWAP_FLAG_PRIO_SHIFT;
} else {
p->prio = --least_priority;
}
swap_list_unlock();
Find a free
swap_info_struct
and initialise it with default values
877-879 Cycle through the swap_info until a struct is found that is not in use
880 By default the error returned is Permission Denied which indicates the caller
did not have the proper permissions or too many swap areas are already in use
881
If no struct was free,
MAX_SWAPFILE
areas have already been activated so
unlock the swap list and return
885-886 If the selected swap area is after the last known active area (nr_swapfiles),
then update
nr_swapfiles
887 Set the ag indicating the area is in use
888-896 Initialise elds to default values
897-902 If the caller has specied a priority, use it else set it to least_priority
and decrement it.
This way, the swap areas will be prioritised in order of
activation
903 Release the swap list lock
904
905
906
907
908
909
910
911
912
error = user_path_walk(specialfile, &nd);
if (error)
goto bad_swap_2;
p->swap_file = nd.dentry;
p->swap_vfsmnt = nd.mnt;
swap_inode = nd.dentry->d_inode;
error = -EINVAL;
Traverse the VFS and get some information about the special le
K.4 Activating a Swap Area (sys_swapon())
904 user_path_walk()
traverses the directory structure to obtain a
structure describing the
specialfile
597
nameidata
905-906 If it failed, return failure
908 Fill in the swap_file eld with the returned dentry
909 Similarily, ll in the swap_vfsmnt
910 Record the inode of the special le
911 Now the default error is -EINVAL indicating that the special le was found but
it was not a block device or a regular le
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
if (S_ISBLK(swap_inode->i_mode)) {
kdev_t dev = swap_inode->i_rdev;
struct block_device_operations *bdops;
devfs_handle_t de;
p->swap_device = dev;
set_blocksize(dev, PAGE_SIZE);
bd_acquire(swap_inode);
bdev = swap_inode->i_bdev;
de = devfs_get_handle_from_inode(swap_inode);
bdops = devfs_get_ops(de);
if (bdops) bdev->bd_op = bdops;
error = blkdev_get(bdev, FMODE_READ|FMODE_WRITE, 0,
BDEV_SWAP);
devfs_put_ops(de);/* Decrement module use count
* now we're safe*/
if (error)
goto bad_swap_2;
set_blocksize(dev, PAGE_SIZE);
error = -ENODEV;
if (!dev || (blk_size[MAJOR(dev)] &&
!blk_size[MAJOR(dev)][MINOR(dev)]))
goto bad_swap;
swapfilesize = 0;
if (blk_size[MAJOR(dev)])
swapfilesize = blk_size[MAJOR(dev)][MINOR(dev)]
>> (PAGE_SHIFT - 10);
} else if (S_ISREG(swap_inode->i_mode))
swapfilesize = swap_inode->i_size >> PAGE_SHIFT;
else
goto bad_swap;
K.4 Activating a Swap Area (sys_swapon())
598
If a partition, congure the block device before calculating the size of the area,
else obtain it from the inode for the le.
913 Check if the special le is a block device
914-939 This code segment handles the case where the swap area is a partition
914 Record a pointer to the device structure for the block device
918 Store a pointer to the device structure describing the special le which will be
needed for block IO operations
919 Set the block size on the device to be PAGE_SIZE as it will be page sized chunks
swap is interested in
921 The bd_acquire() function increments the usage count for this block device
922
Get a pointer to the
block_device
structure which is a descriptor for the
device le which is needed to open it
923 Get a devfs handle if it is enabled.
devfs is beyond the scope of this book
924-925 Increment the usage count of this device entry
927
Open the block device in read/write mode and set the
is an enumerated type but is ignored when
do_open()
BDEV_SWAP
ag which
is called
928 Decrement the use count of the devfs entry
929-930 If an error occured on open, return failure
931 Set the block size again
932 After this point, the default error is to indicate no device could be found
933-935 Ensure the returned device is ok
937-939
Calculate the size of the swap le as the number of page sized chunks
that exist in the block device as indicated by
blk_size.
The size of the swap
area is calculated to make sure the information in the swap area is sane
941
If the swap area is a regular le, obtain the size directly from the inode and
calculate how many page sized chunks exist
943 If the le is not a block device or regular le, return error
945
946
947
948
949
error = -EBUSY;
for (i = 0 ; i < nr_swapfiles ; i++) {
struct swap_info_struct *q = &swap_info[i];
if (i == type || !q->swap_file)
continue;
K.4 Activating a Swap Area (sys_swapon())
950
951
952
953
954
955
956
957
958
959
960
961
962
}
if (swap_inode->i_mapping ==
q->swap_file->d_inode->i_mapping)
goto bad_swap;
swap_header = (void *) __get_free_page(GFP_USER);
if (!swap_header) {
printk("Unable to start swapping: out of memory :-)\n");
error = -ENOMEM;
goto bad_swap;
}
lock_page(virt_to_page(swap_header));
rw_swap_page_nolock(READ, SWP_ENTRY(type,0),
(char *) swap_header);
963
964
965
966
967
968
969
970
971
972
945
599
if (!memcmp("SWAP-SPACE",swap_header->magic.magic,10))
swap_header_version = 1;
else if (!memcmp("SWAPSPACE2",swap_header->magic.magic,10))
swap_header_version = 2;
else {
printk("Unable to find swap-space signature\n");
error = -EINVAL;
goto bad_swap;
}
The next check makes sure the area is not already active. If it is, the error
-EBUSY
946-962
will be returned
Read through the while
swap_info
struct and ensure the area to be
activated is not already active
954-959 Allocate a page for reading the swap area information from disk
961 The function lock_page() locks a page and makes sure it is synced with disk
if it is le backed.
required for the
In this case, it'll just mark the page as locked which is
rw_swap_page_nolock()
function
962 Read the rst page slot in the swap area into swap_header
964-672
Check the version based on the swap area information is and set
swap_header_version
ed, return -EINVAL
974
975
976
variable with it. If the swap area could not be identi-
switch (swap_header_version) {
case 1:
memset(((char *) swap_header)+PAGE_SIZE-10,0,10);
K.4 Activating a Swap Area (sys_swapon())
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
600
j = 0;
p->lowest_bit = 0;
p->highest_bit = 0;
for (i = 1 ; i < 8*PAGE_SIZE ; i++) {
if (test_bit(i,(char *) swap_header)) {
if (!p->lowest_bit)
p->lowest_bit = i;
p->highest_bit = i;
maxpages = i+1;
j++;
}
}
nr_good_pages = j;
p->swap_map = vmalloc(maxpages * sizeof(short));
if (!p->swap_map) {
error = -ENOMEM;
goto bad_swap;
}
for (i = 1 ; i < maxpages ; i++) {
if (test_bit(i,(char *) swap_header))
p->swap_map[i] = 0;
else
p->swap_map[i] = SWAP_MAP_BAD;
}
break;
Read in the information needed to populate the
swap_map
when the swap area
is version 1.
976 Zero out the magic string identing the version of the swap area
978-979 Initialise elds in swap_info_struct to 0
980-988
A bitmap with 8*PAGE_SIZE entries is stored in the swap area. The full
page, minus 10 bits for the magic string, is used to describe the swap map
limiting swap areas to just under 128MiB in size. If the bit is set to 1, there is
a slot on disk available. This pass will calculate how many slots are available
so a
swap_map
may be allocated
981 Test if the bit for this slot is set
982-983
lowest_bit eld is not
lowest_bit will be initialised to 1
If the
yet set, set it to this slot. In most cases,
984 As long as new slots are found, keep updating the highest_bit
K.4 Activating a Swap Area (sys_swapon())
601
985 Count the number of pages
986 j is the count of good pages in the area
990 Allocate memory for the swap_map with vmalloc()
991-994 If memory could not be allocated, return ENOMEM
995-1000 For each slot, check if the slot is good.
to 0, else set it to
SWAP_MAP_BAD
If yes, initialise the slot count
so it will not be used
1001 Exit the switch statement
1003
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
case 2:
if (swap_header->info.version != 1) {
printk(KERN_WARNING
"Unable to handle swap header version %d\n",
swap_header->info.version);
error = -EINVAL;
goto bad_swap;
}
p->lowest_bit = 1;
maxpages = SWP_OFFSET(SWP_ENTRY(0,~0UL)) - 1;
if (maxpages > swap_header->info.last_page)
maxpages = swap_header->info.last_page;
p->highest_bit = maxpages - 1;
error = -EINVAL;
if (swap_header->info.nr_badpages > MAX_SWAP_BADPAGES)
goto bad_swap;
if (!(p->swap_map = vmalloc(maxpages * sizeof(short)))) {
error = -ENOMEM;
goto bad_swap;
}
error = 0;
memset(p->swap_map, 0, maxpages * sizeof(short));
for (i=0; i<swap_header->info.nr_badpages; i++) {
int page = swap_header->info.badpages[i];
if (page <= 0 ||
page >= swap_header->info.last_page)
error = -EINVAL;
else
p->swap_map[page] = SWAP_MAP_BAD;
}
K.4 Activating a Swap Area (sys_swapon())
1039
1040
1041
1042
1043
1044
602
nr_good_pages = swap_header->info.last_page swap_header->info.nr_badpages 1 /* header page */;
if (error)
goto bad_swap;
}
Read the header information when the le format is version 2
1006-1012
Make absolutly sure we can handle this swap le format and return
-EINVAL
if we cannot.
Remember that with this version, the
swap_header
struct is placed nicely on disk
1014 Initialise lowest_bit to the known lowest available slot
1015-1017
Calculate the
swap_map
maxpages
initially as the maximum possible size of a
and then set it to the size indicated by the information on disk.
This ensures the
swap_map
array is not accidently overloaded
1018 Initialise highest_bit
1020-1022
Make sure the number of bad pages that exist does not exceed
MAX_SWAP_BADPAGES
1025-1028 Allocate memory for the swap_map with vmalloc()
1031 Initialise the full swap_map to 0 indicating all slots are available
1032-1038 Using the information loaded from disk, set each slot that is unusuable
to
SWAP_MAP_BAD
1039-1041 Calculate the number of available good pages
1042-1043 Return if an error occured
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
if (swapfilesize && maxpages > swapfilesize) {
printk(KERN_WARNING
"Swap area shorter than signature indicates\n");
error = -EINVAL;
goto bad_swap;
}
if (!nr_good_pages) {
printk(KERN_WARNING "Empty swap-file\n");
error = -EINVAL;
goto bad_swap;
}
p->swap_map[0] = SWAP_MAP_BAD;
K.4 Activating a Swap Area (sys_swapon())
1058
1059
1060
1061
1062
1063
1064
1065
603
swap_list_lock();
swap_device_lock(p);
p->max = maxpages;
p->flags = SWP_WRITEOK;
p->pages = nr_good_pages;
nr_swap_pages += nr_good_pages;
total_swap_pages += nr_good_pages;
printk(KERN_INFO "Adding Swap:
%dk swap-space (priority %d)\n",
nr_good_pages<<(PAGE_SHIFT-10), p->prio);
1066
1046-1051 Ensure the information loaded from disk matches the actual dimensions
of the swap area. If they do not match, print a warning and return an error
1052-1056 If no good pages were available, return an error
1057 Make sure the rst page in the map containing the swap header information
is not used. If it was, the header information would be overwritten the rst
time this area was used
1058-1059 Lock the swap list and the swap device
1060-1062 Fill in the remaining elds in the swap_info_struct
1063-1064
Update global statistics for the number of available swap pages
(nr_swap_pages) and the total number of swap pages (total_swap_pages)
1065-1066 Print an informational message about the swap activation
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
/* insert swap space into swap_list: */
prev = -1;
for (i = swap_list.head; i >= 0; i = swap_info[i].next) {
if (p->prio >= swap_info[i].prio) {
break;
}
prev = i;
}
p->next = i;
if (prev < 0) {
swap_list.head = swap_list.next = p - swap_info;
} else {
swap_info[prev].next = p - swap_info;
}
swap_device_unlock(p);
swap_list_unlock();
error = 0;
goto out;
K.4 Activating a Swap Area (sys_swapon())
1070-1080
604
Insert the new swap area into the correct slot in the swap list based
on priority
1082 Unlock the swap device
1083 Unlock the swap list
1084-1085 Return success
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
bad_swap:
if (bdev)
blkdev_put(bdev, BDEV_SWAP);
bad_swap_2:
swap_list_lock();
swap_map = p->swap_map;
nd.mnt = p->swap_vfsmnt;
nd.dentry = p->swap_file;
p->swap_device = 0;
p->swap_file = NULL;
p->swap_vfsmnt = NULL;
p->swap_map = NULL;
p->flags = 0;
if (!(swap_flags & SWAP_FLAG_PREFER))
++least_priority;
swap_list_unlock();
if (swap_map)
vfree(swap_map);
path_release(&nd);
out:
if (swap_header)
free_page((long) swap_header);
unlock_kernel();
return error;
}
1087-1088 Drop the reference to the block device
1090-1104 This is the error path where the swap list need to be unlocked, the slot
in
swap_info
reset to being unused and the memory allocated for
swap_map
freed if it was assigned
1104 Drop the reference to the special le
1106-1107
Release the page containing the swap header information as it is no
longer needed
1108 Drop the Big Kernel Lock
1109 Return the error or success value
K.4.2 Function: swap_setup()
K.4.2
Function: swap_setup()
605
(mm/swap.c)
This function is called during the initialisation of
page_cluster.
kswapd
to set the size of
This variable determines how many pages readahead from les and
from backing storage when paging in data.
100 void __init swap_setup(void)
101 {
102
unsigned long megs = num_physpages >> (20 - PAGE_SHIFT);
103
104
/* Use a smaller cluster for small-memory machines */
105
if (megs < 16)
106
page_cluster = 2;
107
else
108
page_cluster = 3;
109
/*
110
* Right now other parts of the system means that we
111
* _really_ don't want to cluster much more
112
*/
113 }
102 Calculate how much memory the system has in megabytes
105 In low memory systems, set page_cluster to 2 which means that, at most, 4
pages will be paged in from disk during readahead
108 Else readahead 8 pages
K.5 Deactivating a Swap Area
606
K.5 Deactivating a Swap Area
Contents
K.5 Deactivating a Swap Area
K.5.1 Function: sys_swapoff()
K.5.2 Function: try_to_unuse()
K.5.3 Function: unuse_process()
K.5.4 Function: unuse_vma()
K.5.5 Function: unuse_pgd()
K.5.6 Function: unuse_pmd()
K.5.7 Function: unuse_pte()
K.5.1
Function: sys_swapoff()
606
606
610
615
616
616
618
619
(mm/swaple.c)
This function is principally concerned with updating the
swap_info_struct and
the swap lists. The main task of paging in all pages in the area is the responsibility
of
try_to_unuse().
•
Call
The function tasks are broadly
user_path_walk() to acquire the information about the special le to be
deactivated and then take the BKL
•
Remove the
swap_info_struct
from the swap list and update the global
statistics on the number of swap pages available (nr_swap_pages) and the
total number of swap entries (total_swap_pages. Once this is acquired, the
BKL can be released again
•
Call
try_to_unuse()
which will page in all pages from the swap area to be
deactivated.
•
If there was not enough available memory to page in all the entries, the swap
area is reinserted back into the running system as it cannot be simply dropped.
swap_info_struct is placed into an uninitialised state and
swap_map memory freed with vfree()
If it succeeded, the
the
720 asmlinkage long sys_swapoff(const char * specialfile)
721 {
722
struct swap_info_struct * p = NULL;
723
unsigned short *swap_map;
724
struct nameidata nd;
725
int i, type, prev;
726
int err;
727
728
if (!capable(CAP_SYS_ADMIN))
729
return -EPERM;
730
731
err = user_path_walk(specialfile, &nd);
732
if (err)
K.5 Deactivating a Swap Area (sys_swapoff())
733
734
607
goto out;
728-729
Only the superuser or a process with
CAP_SYS_ADMIN
capabilities may
deactivate an area
731-732 Acquire information about the special le representing the swap area with
user_path_walk().
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
Goto
out
if an error occured
lock_kernel();
prev = -1;
swap_list_lock();
for (type = swap_list.head; type >= 0;
type = swap_info[type].next) {
p = swap_info + type;
if ((p->flags & SWP_WRITEOK) == SWP_WRITEOK) {
if (p->swap_file == nd.dentry)
break;
}
prev = type;
}
err = -EINVAL;
if (type < 0) {
swap_list_unlock();
goto out_dput;
}
if (prev < 0) {
swap_list.head = p->next;
} else {
swap_info[prev].next = p->next;
}
if (type == swap_list.next) {
/* just pick something that's safe... */
swap_list.next = swap_list.head;
}
nr_swap_pages -= p->pages;
total_swap_pages -= p->pages;
p->flags = SWP_USED;
Acquire the BKL, nd the
remove it from the swap list.
735 Acquire the BKL
737 Lock the swap list
swap_info_struct for the area to be deactivated and
K.5 Deactivating a Swap Area (sys_swapoff())
608
738-745 Traverse the swap list and nd the swap_info_struct for the requested
area. Use the
dentry
to identify the area
747-750 If the struct could not be found, return
752-760 Remove from the swap list making sure that this is not the head
761 Update the total number of free swap slots
762 Update the total number of existing swap slots
763 Mark the area as active but may not be written to
764
765
766
swap_list_unlock();
unlock_kernel();
err = try_to_unuse(type);
764 Unlock the swap list
765 Release the BKL
766 Page in all pages from this swap area
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
lock_kernel();
if (err) {
/* re-insert swap space back into swap_list */
swap_list_lock();
for (prev = -1, i = swap_list.head;
i >= 0;
prev = i, i = swap_info[i].next)
if (p->prio >= swap_info[i].prio)
break;
p->next = i;
if (prev < 0)
swap_list.head = swap_list.next = p - swap_info;
else
swap_info[prev].next = p - swap_info;
nr_swap_pages += p->pages;
total_swap_pages += p->pages;
p->flags = SWP_WRITEOK;
swap_list_unlock();
goto out_dput;
}
Acquire the BKL. If we failed to page in all pages, then reinsert the area into
the swap list
767 Acquire the BKL
K.5 Deactivating a Swap Area (sys_swapoff())
609
770 Lock the swap list
771-778 Reinsert the area into the swap list.
The position it is inserted at depends
on the swap area priority
779-780 Update the global statistics
781 Mark the area as safe to write to again
782-783 Unlock the swap list and return
785
if (p->swap_device)
786
blkdev_put(p->swap_file->d_inode->i_bdev, BDEV_SWAP);
787
path_release(&nd);
788
789
swap_list_lock();
790
swap_device_lock(p);
791
nd.mnt = p->swap_vfsmnt;
792
nd.dentry = p->swap_file;
793
p->swap_vfsmnt = NULL;
794
p->swap_file = NULL;
795
p->swap_device = 0;
796
p->max = 0;
797
swap_map = p->swap_map;
798
p->swap_map = NULL;
799
p->flags = 0;
800
swap_device_unlock(p);
801
swap_list_unlock();
802
vfree(swap_map);
803
err = 0;
804
805 out_dput:
806
unlock_kernel();
807
path_release(&nd);
808 out:
809
return err;
810 }
Else the swap area was successfully deactivated to close the block device and
mark the
swap_info_struct
free
785-786 Close the block device
787 Release the path information
789-790 Acquire the swap list and swap device lock
791-799 Reset the elds in swap_info_struct to default values
K.5 Deactivating a Swap Area (sys_swapoff())
610
800-801 Release the swap list and swap device
801 Free the memory used for the swap_map
806 Release the BKL
807 Release the path information in the event we reached here via the error path
809 Return success or failure
K.5.2
Function: try_to_unuse()
(mm/swaple.c)
This function is heavily commented in the source code albeit it consists of speculation or is slightly inaccurate at parts. The comments are omitted here for brevity.
513 static int try_to_unuse(unsigned int type)
514 {
515
struct swap_info_struct * si = &swap_info[type];
516
struct mm_struct *start_mm;
517
unsigned short *swap_map;
518
unsigned short swcount;
519
struct page *page;
520
swp_entry_t entry;
521
int i = 0;
522
int retval = 0;
523
int reset_overflow = 0;
525
540
start_mm = &init_mm;
541
atomic_inc(&init_mm.mm_users);
542
540-541
The starting
mm_struct
to page in pages for is
init_mm.
The count is
incremented even though this particular struct will not disappear to prevent
having to write special cases in the remainder of the function
556
557
558
559
560
561
562
563
564
565
572
573
while ((i = find_next_to_unuse(si, i))) {
/*
* Get a page for the entry, using the existing swap
* cache page if there is one. Otherwise, get a clean
* page and read the swap into it.
*/
swap_map = &si->swap_map[i];
entry = SWP_ENTRY(type, i);
page = read_swap_cache_async(entry);
if (!page) {
if (!*swap_map)
continue;
K.5 Deactivating a Swap Area (try_to_unuse())
574
575
576
577
578
579
580
581
582
583
584
585
556
611
retval = -ENOMEM;
break;
}
/*
* Don't hold on to start_mm if it looks like exiting.
*/
if (atomic_read(&start_mm->mm_users) == 1) {
mmput(start_mm);
start_mm = &init_mm;
atomic_inc(&init_mm.mm_users);
}
This is the beginning of the major loop in this function. Starting from the
swap_map, it searches for the next entry to be freed
find_next_to_unuse() until all swap map entries have been paged in
beginning of the
with
562-564 Get the swp_entry_t and call read_swap_cache_async() (See Section K.3.1.1)
to nd the page in the swap cache or have a new page allocated for reading in
from the disk
565-576
If we failed to get the page, it means the slot has already been freed in-
dependently by another process or thread (process could be exiting elsewhere)
or we are out of memory. If independently freed, we continue to the next map,
else we return
581
Check to make sure this mm is not exiting. If it is, decrement its count and
go back to
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
-ENOMEM
init_mm
/*
* Wait for and lock page. When do_swap_page races with
* try_to_unuse, do_swap_page can handle the fault much
* faster than try_to_unuse can locate the entry. This
* apparently redundant "wait_on_page" lets try_to_unuse
* defer to do_swap_page in such a case - in some tests,
* do_swap_page and try_to_unuse repeatedly compete.
*/
wait_on_page(page);
lock_page(page);
/*
* Remove all references to entry, without blocking.
* Whenever we reach init_mm, there's no address space
* to search, but use it as a reminder to search shmem.
*/
shmem = 0;
K.5 Deactivating a Swap Area (try_to_unuse())
604
605
606
607
608
609
610
611
612
swcount = *swap_map;
if (swcount > 1) {
flush_page_to_ram(page);
if (start_mm == &init_mm)
shmem = shmem_unuse(entry, page);
else
unuse_process(start_mm, entry, page);
}
595 Wait on the page to complete IO. Once it returns, we know for a fact the page
exists in memory with the same information as that on disk
596 Lock the page
604 Get the swap map reference count
605 If the count is positive then...
606
As the page is about to be inserted into proces page tables, it must be freed
from the D-Cache or the process may not see changes made to the page by
the kernel
607-608
If we are using the
init_mm,
call
shmem_unuse()
(See Section L.6.2)
which will free the page from any shared memory regions that are in use
610 Else update the PTE in the current mm which references this page
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
if (*swap_map > 1) {
int set_start_mm
struct list_head
struct mm_struct
struct mm_struct
= (*swap_map >= swcount);
*p = &start_mm->mmlist;
*new_start_mm = start_mm;
*mm;
spin_lock(&mmlist_lock);
while (*swap_map > 1 &&
(p = p->next) != &start_mm->mmlist) {
mm = list_entry(p, struct mm_struct,
mmlist);
swcount = *swap_map;
if (mm == &init_mm) {
set_start_mm = 1;
spin_unlock(&mmlist_lock);
shmem = shmem_unuse(entry, page);
spin_lock(&mmlist_lock);
} else
unuse_process(mm, entry, page);
if (set_start_mm && *swap_map < swcount) {
K.5 Deactivating a Swap Area (try_to_unuse())
631
632
633
634
635
636
637
638
639
}
new_start_mm = mm;
set_start_mm = 0;
}
atomic_inc(&new_start_mm->mm_users);
spin_unlock(&mmlist_lock);
mmput(start_mm);
start_mm = new_start_mm;
}
612-637
613
If an entry still exists, begin traversing through all
mm_structs
nding
references to this page and update the respective PTE
618 Lock the mm list
619-632
Keep searching until all
mm_structs
have been found. Do not traverse
the full list more than once
621 Get the mm_struct for this list entry
623-627
Call
shmem_unuse()(See
init_mm as that
Else call unuse_process()
Section L.6.2) if the mm is
indicates that is a page from the virtual lesystem.
(See Section K.5.3) to traverse the current process's page tables searching for
the swap entry. If found, the entry will be freed and the page reinstantiated
in the PTE
630-633
Record if we need to start searching
mm_structs
starting from
init_mm
again
654
655
656
657
658
659
660
661
662
654
if (*swap_map == SWAP_MAP_MAX) {
swap_list_lock();
swap_device_lock(si);
nr_swap_pages++;
*swap_map = 1;
swap_device_unlock(si);
swap_list_unlock();
reset_overflow = 1;
}
If the swap map entry is permanently mapped, we have to hope that all
processes have their PTEs updated to point to the page and in reality the swap
map entry is free. In reality, it is highly unlikely a slot would be permanetly
reserved in the rst place
645-661 Lock the list and swap device, set the swap map entry to 1, unlock them
again and record that a reset overow occured
K.5 Deactivating a Swap Area (try_to_unuse())
683
614
if ((*swap_map > 1) && PageDirty(page) &&
PageSwapCache(page)) {
rw_swap_page(WRITE, page);
lock_page(page);
}
if (PageSwapCache(page)) {
if (shmem)
swap_duplicate(entry);
else
delete_from_swap_cache(page);
}
684
685
686
687
688
689
690
691
692
683-686
In the very rare event a reference still exists to the page, write the page
back to disk so at least if another process really has a reference to it, it'll copy
the page back in from disk correctly
687-689 If the page is in the swap cache and belongs to the shared memory lesys-
swap_duplicate() so we can try and
shmem_unuse()
tem, a new reference is taken to it wieh
remove it again later with
691 Else, for normal pages, just delete them from the swap cache
699
700
701
SetPageDirty(page);
UnlockPage(page);
page_cache_release(page);
699 Mark the page dirty so that the swap out code will preserve the page and if it
needs to remove it again, it'll write it correctly to a new swap area
700 Unlock the page
701 Release our reference to it in the page cache
708
714
715
716
717
718
714
}
715
716
717
718 }
if (current->need_resched)
schedule();
mmput(start_mm);
if (reset_overflow) {
printk(KERN_WARNING "swapoff: cleared swap entry
overflow\n");
swap_overflow = 0;
}
return retval;
708-709
Call
schedule()
the entire CPU
if necessary so the deactivation of swap does not hog
K.5 Deactivating a Swap Area (try_to_unuse())
615
717 Drop our reference to the mm
718-721
If a permanently mapped page had to be removed, then print out a
warning so that in the very unlikely event an error occurs later, there will be
a hint to what might have happend
717 Return success or failure
K.5.3
Function: unuse_process()
(mm/swaple.c)
This function begins the page table walk required to remove the requested
and
entry from the process page tables managed by mm.
page
This is only required when
a swap area is being deactivated so, while expensive, it is a very rare operation. This
set of functions should be instantly recognisable as a standard page-table walk.
454 static void unuse_process(struct mm_struct * mm,
455
swp_entry_t entry, struct page* page)
456 {
457
struct vm_area_struct* vma;
458
459
/*
460
* Go through process' page directory.
461
*/
462
spin_lock(&mm->page_table_lock);
463
for (vma = mm->mmap; vma; vma = vma->vm_next) {
464
pgd_t * pgd = pgd_offset(mm, vma->vm_start
Download