Experimental data on page replacement algorithm

advertisement
Experimental data on page replacement algorithm
by N. A. OLIVER
General M otars Research Labarataries
Warren, Michigan
INTRODUCTION
time, regardless of the task to which it belongs. This
RA, which is a varying partitions RA by default, is
heavily considered in literature. 5,10,l1 Comparison results with various RAs obtained via simulation techniques and interpretive execution1,12,13 are available.
However, few if any (non-simulation) system measurements have been conducted. Existing (and fully
developed) VM operating systems utilizing variations
of this RA known to the author are: CP/67,14,15
Multics,14,16 MTS,14 VSF7 and VS2.1S
2. Local LRU ",ith fixed main memory paging buffer
per task RA: The least recently used selection is
made from pages belonging to the task which generated the page fault. Some treatmentl,4,10,13,19,20 and
measurements of this RA were found in literature.
However, only one operating system (besides the
interim version of MCTS) implements a remote variation of this RA. It is the original IBM version of
TSS.14,21
3. Local LRU with varying (working set)5 partitions
RA (WSRA): The replaced page is the least recently
used page which does not belong to a working set of
any task. Extensive literature is available. 4,5,6,7,l1,13,19,
20,22,23,24,25 Two true implementations (Burroughs
B670026 and CP/67 at IRIA France26) and one approximation (the current version of TSS27) of this
RA are known with limited measurement results.
Although paged VM (Virtual Memory) systems are being
implemented more and more, their full capabilities have not
yet been realized. Early research in this field pointed to
possible inefficiencies in their implementation. 1-3 Subsequent
studies, however, led to the conclusion that paged VM
systems could provide a productive means to run large
programs on small main memory, if proper techniques are
employed.4-7 One of the most influential of these is the
choice of an efficient page replacement algorithm (RA) to
minimize page traffic between the different levels of memory.
This paper compares the performance of two RAs about
which little system performance measurement data is available. They are: the Global Least Recently Used (LRU) and
the Local LRU with fixed and equal size main memory
buffer allotted to each task. The number of page faults
caused during execution of programs under each RA is used
as an inverse criterion for its effectiveness.
These studies were conducted at the General Motors Research Laboratories (GMR) on the CDC STAR-IB* Virtual
Memory computerS (core size = 65K of 64 bit words; auxiliary/main memory access time ratio of 50000) with the
Multi-Console-Time-Sharing (MCTS) operating system. 9
PAGE REPLACEMENT ALGORITHMS
Due to the implementation difficulties of the WSRA, only
limited (special-case) measurements were taken of it.
A basic problem in paged VM systems is deciding which
page should be removed from main memory when an additional page of information is needed. Obviously, it should
be a page with the least likelihood of being needed in the
near future. Therefore a simple criterion for the "goodness"
of a page RA is the minimization of page traffic between the
main and auxiliary memories which is measured by the
number of faults that occur during program execution.
One of the most popular page replacement strategies is
LRU (Least Recently Used) strategy. The following RAs are
based on it:
THE TESTING ENVIRONMENT
System characteristics
Page-table: The STAR computer page-table2S provides an
address translation mechanism for all memory references. It
points to pages of main memory in use and provides the
mapping between the virtual address and the physical location of a page. The page-table ordering is hardware maintained; its entries (one for each page) are LRU ordered.
Thus, the most recently accessed pages migrate to the top
of the table while the least recently used move to the bottom.
(The difference in address translation time between top and
1. Global LRU RA: The replaced page is the one that
has not been referenced for the longest period of real
"' STAR-IB is a microprogrammed prototype version of the STAR-IOO
CDC computer.
179
From the collection of the Computer History Museum (www.computerhistory.org)
180
National Computer Conference, 1974
TABLE I-Results of Identical-tasks Test
Customer tasks tested
Number of
terminals
Malus compilation of 185 source lines
2
3
4
.5
Malus compilation of 450 source lines
OPL compilation of 160 source lines
OPL compilation of 575 source lines
INV matrix inversion 2ooX200
6
1
2
3
1
2
3
1
2
1
2
3
LIST_CAT sorting routine
P ANICD dump formatting routine
2
3
1
2
3
4
bottom entries of the page table, due to longer search time,
is insignificant relative to the other system time parameters.)
Level of multiprogramming: The MCTS operating system
can be multiprogrammed up to a level corresponding to the
maximum number of terminals supported by the system,
which is seven.
Scheduling: A round robin scheduling scheme among the
multiprogrammed tasks is employed. Tasks which are in
page or other I/O wait state are skipped. 'When a task's
time-slice expires that task is replaced by a task waiting for
service and which is also chosen in a round robin fashion.
If none is waiting the task with expired time-slice is allowed
to continue. (In this study the level of multiprogramming is
always equal to the number of running tasks and thus the
time-slice parameter is not utilized.)
Paging space: It includes a maximum of 92 pages. Each
page contains 512 64-bit words. On the interim MCTS
system, this space is equally divided among the multiprogrammed tasks.
Paging mechanisms
In the interim version of ~1:CTS, each task has a private
page-table. Depending on the paging space, each multiprogrammed task is allotted a fixed number of pages. The
Local LRU RA is used for page replacement.
For comparison purposes, MCTS was reprogrammed with
the Global LRU RA. Only the paging mechanism was
changed. No other system parameters such as multiprogramming, scheduling, paging space, etc., were modified.
Global LRU
(#P.F.)
Local LRU
(#P.F.)
Local/Global
LRU
88
245
627
3765
6049
9348
119
428
10474
86
133
446
143
382
103
15524
15628
104
116
125
448
461
490
520
88
488
2241
4674
7863
13901
119
4078
11943
86
242
751
143
917
103
15548
15688
104
120
190
448
477
498
519
1.000
1.991
3.574
1.241
1.299
1.487
1.000
9.528
1.140
1.000
1.819
1.683
1.000
2.400
1.000
1.001
1.001
1.000
1.034
1.520
1.000
1.035
1.016
0.998
Extreme paging
buffer
23-68
18-73
30-61
24-67
45-46
43-48
44-47
The modifications for the Global LRU involved the use
of a single page-table for all the customer tasks. All available
pages in main memory were put into a general pool. When a
page fault occurred, the Global LRU page which was the
last entry in the hardwar~managed single page-table, was
replaced. No sharing of pages was allowed.
TESTING TECHNIQUES
In an effort to choose typical and diverse applications,
these GMR developed customer tasks were tested:
Malus-A compiler for PL/I-like language designed to
generate object code for the STAR computer. Two compilations, one of 185 and the other of 450 source lines were
measured.
OPL-A compiler for computer graphics language. Again,
two compilations, one of 160 and the other of 575 source
lines were examined.
INV-A matrix inversion routine. Measurements were
taken for inversion of a 200X200 order matrix.
LIST_CAT-A sorting routine. For this study a list of
700 names was /Surted in several different orders.
PANICD-A compute-bound routine designed to format
MCTS core dumps for printing. It formatted about 50000
words for these tests.
The tests were performed by running identical and nonidentical tasks simultaneously from a varying number of
terminals. Each set of tests was executed twice; once with
the Global LRU system and then repeated with the Local
From the collection of the Computer History Museum (www.computerhistory.org)
Experimental Data on Page Replacement Algorithm
LRU system. The total paging space was held constant in
each case for both systems.
N = Numher of pag"s used by task .1.
(Number of pages used by task t2 = 91 -
181
If)
01~~TTrM~~~~rM~~~~~~~Trnn~
RESULTS
9')
Identical-tasks test
I: 1'1 1111 ': 1 11 1 11 ;: I:: 11111111,111111'::1
I : : I : : : : : I : : : : : I I I I I I : I I ( I I : I I I I : II I I I ,I
, 1111111111,11111111111 I1I1 1111111
'II I I I IIII 1111 ,
111'1
I I I I 11 TASK 12 I 'I
I I' I I'
I
I I
' : I , I I I I I : ' : 1 : ' I : : I : I' I II I II , I I II I' If: I I
" II,:: : : 1'1' l: '1,1,,: III ' II ,1'111 11 ' ,II
I, , III I III : I I'~'II: 1 1,1 1 \ : 11,1: I1III : t I
'Ill'll' III II I I "
I I' " 111(11 I' I' '1'1 11
~II'IIIII
III f I ,11 11 11'11 I,
~I"!(T: Iii
~ .. '+rt-tp~l+t
:
I', ,
Table I displays the average number of page faults per
task (# P.F.) generated by identical multiprogrammed tasks
which are running simultaneously on each of the two paging
systems for varying number of terminals. Also included are
ratio values representing the relative performance of the
Local LRU in relation to the Global LRU. (The "extrem~
paging buffer" column will be explained later.)
While experimenting with the Global LRU system, it was
observed that the number of pages used by each of the
simultaneously running tasks varied considerably during execution. On the other hand the number of pages used by
each task with the Local LRU system remained constant
(by design). For example, in the two terminal Malus compilation of 450 source lines (Table I), which displays the most
extreme difference, the Local LRU system divided the available 91 pages between the two tasks giving one 45 and the
other 46 pages. With the Global LRU, on the other hand,
tasks competed \vith each other for pages in main memory
and the number of pages which each "owned" as a function
of the elapsed execution time is displayed in Figure 1.
AB these two identical tasks were started at the same
instant, one would tend to think that they would split the
core pages evenly among them and each would occupy close
to half of memory at any given time (as in the Local LRU
case). But as can be seen from Figure 1 this did not happen.
These tasks dynamically changed the number of pages which
they occupied with significant fluctuation.
One explanation might be that while executing, most
programs change their locality5 characteristics and consequently their working set size changes. If tasks are allowed
to compete for pages, they tend to accumulate as many
working set pages as they can in order to run effectively.
For this purpose they use pages obtained from other tasks
which at that moment (due to the negligible difference in
starting time) are executing at other stages of the same
program at which they usually have different locality properties and possibly require, or are forced to occupy, a smaller
working set of pages.
In addition, it was observed that even though the above
two tasks, started virtually at the same time, one task
finished executing well ahead of the other. This can be
clearly observed in Figure 1. At the start of execution Task
#1 had 51 pages while Task #2 had only 40. Afterwards in
most cases Task #2 had more pages. At point A Task #2
completed its execution and its pages started migrating to
Task #1. At point B all of main memory belonged to Task
#1. Due to this page migration from one task to the other
and vice versa, Task #2 ran better up to point A and thus
Task #1 was able to run efficiently from point B to completion.
70
~5
4S 40
25
I
'I'J
',I
~II"
It:
IIJ :II, , I
111I1I11
,1 1
t
r1 II
I
II t I
~
ll,
11
1'1,1
~~'
10
TASK t 1
TIll 1
50
100
150
200
250-
Elapsed execution time (sec.)
Figure I-Variation in the number of pages "owned" by each of two
tasks while executing under the Global LRU policy
The elapsed execution time with the Global LRU system
was considerably shorter for high Local/Global ratios. But
whenever the ratio was close to one, this time was virtually
equal. As an example, in the case displayed in Figure 1 the
compilation under the Global LRU system lasted only 231
seconds while under the Local LRU system the same compilation took 746 seconds.
The compilation cases generated considerably more pagefaults under the Local LRU RA. But the difference in number
of page-faults generated is less severe for LIST_CAT, and
there is almost no difference in the INV and PANICD cases.
This could be explained by the different working set characteristics of these programs. The Malus and OPL compilers
change their working set sizes dynamically at a high rate
while the rest of the tasks tend to have a fixed size or slowly
changing working sets of pages. An indication of the rate of
change of the working set size, in each case, for a multiprogramming level of two, can be obtained from the "extreme paging buffer" column in Table I. The figures in this
column represent the number of pages each of the two tasks
"owned" during the most extreme situation for that run
with the Global LRU RA. As the extreme buffer difference
gets larger, so does the performance ratio which is displayed
in the adjacent column of Table I.
N on-identical-tasks test
At this point we felt that although the Local LRU showed,
in some cases, poor performance when running the same
tasks one against the other, it might still be a useful tool to
From the collection of the Computer History Museum (www.computerhistory.org)
182
National Computer Conference, 1974
TABLE II- Results of Non-identical-tasks Test
Customer tasks tested
Malus compilation of 450
source lines
OPL compilation of 575
source lines
INV matrix inversion 200X
Global LRU Local LRU Local/Global
(#P.F.)
(#P.F.)
LRU
1075
7239
6.734
1017
4271
4.200
1221
4822
3.949
519
557
1.073
200
PANICD dump formatting
routine
prevent a complete system degradation in cases where some
of the running tasks are in a thrashing state while others are
not. We thought that by having a separate page table and
a fixed number of pages for each task, the thrashing task
would only degrade itself without affecting the rest of the
system.
To test this situation, a four non-identical task mixture
displayed in Table II was simultaneously run from four
terminals. This experiment was designed so that all the
tasks except P ANICD would thrash with both the Global
and Local LRU systems. Under the Local LRU every task
"owned" one fourth of core while under the Global LRU
all tasks compete for pages. Thus P ANICD which does not
thrash when running with one-fourth of core should have
benefited from the "protection" provided to its paging
buffer under the Local LRU algorithm whereas under the
Global LRU the thrashing tasks could affect its performance
by "taking away" its essential pages because they are in
high need for pages.
But as can be seen in Table II, the number of page-faults
which P ANICD generated was not affected at all by the
thrashing tasks.
The explanation to these unexpected results might be that
tasks which are running effectively (PANICD), reference
frequently (and thus "protect") their slow changing working
set of pages. On the other hand the thrashing task needs
many pages, each page for a short interval, and does not
reference the same pages too often. Thus the pages of the
thrashing task are, in most cases, the least recently used
pages which migrate to the bottom of the page-table and
are consequently overwritten. The thrashing task is only
slightly affected by this process since chances are that it will
need many other pages before requiring the pages which were
just lost.
(It contains the procedure pages, one input and one output
data pages.) This fact explains the similar performance of
the Local and Global LRU for PANICD as shown in Tables
I and II. The slight difference in the number of page faults
is attributed to the execution of some system programs
known as "command language" before and after the actual
execution of PANICD. These programs require large and
rapidly changing working set sizes and thus contribute to
the fewer number of page faults for the Global LRU in
most cases.
In order to get relative performance measurements of the
WSRA versus the Global LRU we decided to eliminate the
effect of the "command language" by initiating the measurements only after all the multiprogrammed P ANICD tasks
have started their actual execution and terminate the data
collection just before the first PANICD task branches back
to the "command language." Thus a fixed size working set
was required by each P ANICD task at any time.
Since the WSRA requires that each multiprogrammed
task have at least its working set of pages in main memory at
all times, a Local LRU level of multiprogramming which
provides a partition larger then the working set size ",-ill
actually satisfy the WSRA requirements. The results for
running identical PANICD tasks from a different number
of terminals (corresponding to different levels of multiprogramming) are presented in Table III.
The column "dump units processed per second" presents
the total number of dump-pages formatted by all the
P ANICD tasks, divided by the elapsed time required to run
all the tasks at each multiprogramming level. Thus this
column represents the real throughput of the entire system.
The other column "number of page faults per dump unit
processed" shows the total number of page faults generated
by all tasks, divided by the total of all the dump-pages
formatted.
The throughput of the system increases with the level of
multiprogramming, for both systems, up to the level of three
and four while the number of page faults per unit dump
remains low. Beyond the level of four both systems are
thrashing and the throughput consequently deteriorates.
Thus under the WSRA policy we would have run the system
TABLE III-System Throughput and Page Fault Frequency for
Increasing Levels of Multiprogramming
Number of
terminals
(multiprog.
level)
Dump units
proc./sec.
#PF/dump
units proc.
Dump units
proc./sec.
#PF/dump
units proc.
2
3
4
5
6
7
0.37
0.42
0.46
0.46
0.42
0.39
0.29
4.21
4.22
4.23
4.24
7.55
11.90
20.30
0.37
0.42
0.46
0.46
0.42
0.40
0.31
4.21
4.22
4.23
4.23
7.82
10.19
21.10
Working set replacement algorithm measurements
As we did not implement this RA which implies use of
varying partitions with varying working-set sizes, we decided
to test at least a special case of it, which is running a task
with a fixed working set size in a fixed size partition.
PANICD has a small and fixed size workhl.g set of pages.
Global LRU
Local LRU (WSRA)
From the collection of the Computer History Museum (www.computerhistory.org)
Experimental Data on Page Replacement Algorithm
at a multiprogramming level not higher than four. But for
levels one through four the Global and Local LRU, which in
this case is identical to the WSRA, perform virtually the
same.
Above the level of four the Local LRU (WSRA) does
show better performance and that is probably due to the
fact that with the Global LRU system a task which is waiting
the longest on the scheduler queue and is the next one to run
its pages become the least recently used ones and are overwritten. As this happens only after reaching a thrashing
level of multiprogramming, a Global LRU RA can be useful
only if the system can detect when overloading occurs.
Such a performance monitor (based, for example, on the
page-faulting level of the entire system) could be used to
reduce the multiprogramming level whenever thrashing occurs to prevent performance degradation.
CONCLUSIONS
The results of this study strongly indicate that artificially
restricting the main memory space which a task may utilize
in a paged VM system results in an increased page traffic
between the different levels of memory and consequently in
considerable loss of efficiency. Tasks, especially if they require rapidly changing working set sizes, should be allowed
to compete freely for the space which each may occupy at
any given time.
The Global LRU RA performed better than the Local
LRU RA with fixed partitions and matched the performance
of the Local LRU with varying partitions (WSRA), for a
non-thrashing situation, in this study. The following Global
LRU virtues should be noted:
1. It is a simple varying partitions RA in which the
partition size is controlled by the RA itself and not
by the operating system.
2. It is highly unlikely that thrashing tasks can "overtake" main memory and thus "hurt" the performance
of non-thrashing tasks. This is due to the fact that
non-thrashing tasks reference frequently and thus
"protect" their essential pages from becoming the
LRU ones. On the other hand the thrashing task
needs many pages, each for a short interval, and
does not reference the same pages too often. Thus the
pages of the thrashing task are, in most cases, the
LRU ones and are consequently overwritten.
3. Critics of the Global LRU strategy (including the
author 7) claim that with the Global LRU RA, the
task which is idle for the longest time while waiting
on the scheduler queue, and which is the next to run
is most likely to find its pages missing. Evidence of
this has been found in these studies. But it turns out
that the space could be utilized more effectively by
the current running task than it would have been if
these pages had been reserved without utilization for
the delayed task.
183
4. In addition to the performance advantage (reduction
of execution time and number of page faults), the
Global LRU with the single page-table is easier to
manage and requires less operating system space than
the Local LRU with multi-page-tables and fixed paging
buffer system.
The Global LRU algorithm is especially useful for simple
round robin scheduled operating systems. For more sophisti~
cated systems, however, the multi-page-table approach might
be useful due to requirements other than efficiency, such as:
priority scheduling, etc. But since no artificial restrictions
should be imposed on main memory space it seems that the
working-set-partitions RA's such as the WSRA5 and PFF13
might well be the only class of paging strategies able to
perform effectively utilizing multi-page-tables. However, the
WSRA will require a "smart" mechanism to determine the
follmving: (A) The size of the task's working set at anv
given time. (B) When a task is in a working set expansio~
phase and needs more pages, which of the other multiprogrammed tasks will be the one to lose pages. (C) What
action should be taken when there are a few available pages
in core but not enough to start a new task. In the Global
LRU case no information about (A) is needed; the decision
about (B) is trivial; as for (C) the new task is started and
it "fights" to build its working set from pages which are
probably non-useful to other tasks.
The WSRA has a clear advantage over the Global LRU;
It prevents system overloading. Thus if the Global LRU is
to be used, a special system performance monitor could be
used to reduce the level of multiprogramming whenever overloading occurs.
ACKNOWLEDGMENT
I wish to thank G. G. Dodd for his support of this study and
advice on organizing the paper. Also I am grateful to R. R.
Brown, J. W. Boyse, M. Cianciolo and the rest of the MCTS
personnel for their cooperation. Last but not least I wish to
thank P. J. Denning and W. W. Chu for their constructive
criticism of this paper.
REFERENCES
1. Coffman, E. G. and L. C. Varian, "Further Experimental Data on
the Behavior of Programs in a Paging Environment," Comm. ACM,
11, July 1968, 471-474.
2. Fine, G. H., C. W. Jackson and P. V. McIsaac, "Dynamic Program
Behavior under Paging," Proc. 21st Nat. Conf. ACM, ACM Pub.
P-66, 1966, pp. 223-228.
3. Kuehner, C. J. and B. Randell, "Demand Paging in Prospective,"
Proc. AFIPS 1968 Fall Joint Compo Conf., Vol. 33, pp. 1011-1018.
4. Denning, P. J., "Virtual Memory," Computing Surveys, Vol. 2,
No.3, Sept. 1970, pp. 153-189.
5. Denning, P. J., "The working-set model for program behavior,"
Comm. ACM., Vol. 11, May 1968, pp. 323-333.
6. Chu, W. W., N. Oliver, and H. Opderbeck, "Measurement Data
on the Working Set Replacement Algorithm and Their Applica-
From the collection of the Computer History Museum (www.computerhistory.org)
184
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
National Computer Conference, 1974
tions," Proc. Brooklyn Polytechnic Institute Symposium on Computer-Communications Networks and Teletraphic, Vol. 22, Apr. 1972.
Oliver, N., Optimization of Virtual Paged Memories, Master thesis,
Univ. of Calif. Los Angeles, 1971.
Holland, S. A., and C. L. Purcel, "The CDC STAR-1oo a large
scale network oriented computer system," IEEE Proc. of the
International Computer Society Conference, Boston, Mass., Sep. 22-24,
1971.
Brown, R. R., J. L. Elshoff, and M. R. Ward, et al., Collection of
MCTS Papers, to be published, G. M. Res. Labs., Warren, Mich.
1974.
Belady, L. A., "A Study of Replacement Algorithms for a Virtualstorage Computer," IBM Syst. J., Vol. 5 No.2, 1966, pp. 78-101.
Denning, P. J., "Thrashing: Its Causes and Prevention," Proc.
AFIPS 1968 Fall Joint Compo Conf., Vol. 33, pp. 915-922.
Thorington, J. M., J. D. Irvin, "An Adaptive Replacement Algorithm for Paged-memory Computer Systems," IEEE Trans.
Vol. c-21, Oct. 1972, pp. 1053-1061.
Chu, W. W. and H. Opderbeck, "The Page Fault Frequency Replacement Algorithm," Proc. AFIPS 1972 FJCC, Vol. 41, pp. 597609.
Alexander, M. T., Time Sharing Supervisor Program, Univ. of
Mich. Computing Center, May 1969.
Bayels, R. A., et al., Control Program-67/CamJJridge Monitor System
(CP-67/CMS), Program Number 360D 05.2.005, Cambridge, Mass.,
1968.
Organick, E. I., A Guide to Multics for Sub-System Writers, Project
MAC, 1969.
17. IBM OS/Virtual Storage 1 Features Supplement, No. GC20-1752-0.
18. IBM OS/Virtual Storage 2 Features Supplement, No. GC20-1753-0.
19. Coffman, E. G. and T. A. Ryan, "A Study of Storage Partitioning
Using a Mathematical Model of Locality," Comm. ACM 15, March
1972, pp. 185-190.
20. Oden, P. H. and G. S. Shedler, A Model of Memory Contention in a
Paging Machine, IBM Res. Tech. Rep. RC3056, IBM Yorktown
Heights, N. Y., Sept. 1970.
21. IBM System/360 Time Sharing Operating System Program Logic
Manual, File No. S360-36 GY28-2oo9-2, New York 1970.
22. Denning, P. J. and S. C. Schwartz, "Properties of the WorkingSet Model," ACM, 15, March 1972, pp. 191-198.
23. DeMeis, W. M. and N. Weizer, "Measurement Data Analysis of a
Demand Paging Time Sharing System," ACM Proc. 1969, pp. 201216.
24. Openheimer, G. and N. Weizer, "Resource Management for a
Medium Scale Time-sharing System," Comm. ACM, Vol. 11, May,
1968, pp. 313-322.
25. Spirn, J. R. and P. J. Denning, "Experiments with Program Locality," Proc. AFIPS 1972 FJCC, Vol. 41, pp. 611-621.
26. Private communication with P. J. Denning.
27. Doherty, W. J., "Scheduling TSS/360 for Responsiveness," Proc.
AFIPS 1970 FJCC, Vol. 37, AFIPS Press, Montvale, N. J., pp.
97-112.
28. Curtis, R. L., "Management of High Speed Memory in the STAR100 Computer," IEEE Proc. of the International Computer Society
Conference, Boston, Mass., Sep. 22-24, 1971.
From the collection of the Computer History Museum (www.computerhistory.org)
Download