2 Project Overview

advertisement
CSE 221 Project – Winter 2003
System Performance Measurements
on
RedHat Linux 8.0 and Windows XP 5.1
Project Members:
Urvashi Rao Venkata
Reena Mathew
System Performance Measurements
March 14, 2003
-1-
Table of Contents
1 INTRODUCTION ........................................................................................................................................................... 4
2 PROJECT OVERVIEW ................................................................................................................................................. 4
3 MEMORY....................................................................................................................................................................... 5
3.1 CACHE VS RAM VS SWAP DISK ................................................................................................................................ 5
3.2 CODE/MEMORY OPTIMIZATIONS ................................................................................................................................ 6
3.3 LOOPING AND SUB-LOOPING ..................................................................................................................................... 7
4 DISK IO ......................................................................................................................................................................... 9
4.1 DISK ACCESS (READ-WRITE-READ)............................................................................................................................. 9
4.2 SEQUENTIAL VS RANDOM READS ............................................................................................................................ 13
4.3 READ VS MMAP....................................................................................................................................................... 15
4.4 REMOTE FILE ACCESS ............................................................................................................................................ 18
5 PROCESSES .............................................................................................................................................................. 19
5.1 SYSTEM CALLS ....................................................................................................................................................... 19
5.2 FUNCTION PERFORMANCE WITH DIFFERENT IMPLEMENTATION STYLES. ...................................................................... 20
6 ARITHMETIC OPERATIONS ...................................................................................................................................... 22
6.1 INTEGER AND FLOAT OPERATIONS ........................................................................................................................... 22
7 LINUX VS WINDOWS ................................................................................................................................................. 23
7.1 MEMORY ................................................................................................................................................................ 23
7.2 DISK IO .................................................................................................................................................................. 25
7.3 PROCESS ............................................................................................................................................................... 27
8 CONCLUSION ............................................................................................................................................................. 28
APPENDIX 1 .................................................................................................................................................................. 29
REFERENCES ............................................................................................................................................................... 30
System Performance Measurements
March 14, 2003
-2-
Table of Figures
Figure 1 : Code Segment for Memory Access ................................................................................................................. 5
Figure 2 : Graphs for Memory Access .............................................................................................................................. 5
Figure 3 : Graph for Memory Access for Code plus Data ................................................................................................ 7
Figure 4 : Code Segment for Looping and Sub-Looping .................................................................................................. 7
Figure 5 : Graphs for Sub-Looping with 8192KB on Linux ............................................................................................... 8
Figure 6: Graphs for Sub-Looping with 8192KB on Windows .......................................................................................... 8
Figure 7 : Code Segment for Disk Access using 'fread' and 'fwrite' ................................................................................. 9
Figure 8 : Graphs for Disk Read And Write Access on Linux......................................................................................... 10
Figure 9 : Graphs for Disk Read And Write Access on Windows .................................................................................. 10
Figure 10 : Code Segment for Disk Access using 'open' and 'close' ............................................................................. 11
Figure 11 : Graphs for Disk Read And Write Access on Linux....................................................................................... 11
Figure 12 : Graphs for Disk Read And Write Access on Windows ................................................................................ 12
Figure 13 : Code Segment for Disk Access using 'mmap' and 'munmap' ...................................................................... 12
Figure 14 : Graph for Disk Access using 'mmap' and 'munmap' .................................................................................... 13
Figure 15 : Code Segment for Sequential and Random Disk Access ............................................................................ 13
Figure 16 : Graphs for Sequential and Random Disk Access on Linux and Windows .................................................. 14
Figure 17 : Graphs for Sequential and Random Disk Access on Linux to determine Page Read Ahead Size ............. 15
Figure 18 : Graph for Sequential and Random Disk Access on Windows to determine Page Read Ahead Size ......... 15
Figure 19 : Code Segment for 'read' Vs 'mmap' ............................................................................................................. 16
Figure 20 : Graphs for Disk Access using 'read' and 'mmap' ......................................................................................... 16
Figure 21 : Individual graphs for 'mmap' and normal ' read-write' .................................................................................. 17
Figure 22 : Table for read and mmap timings ................................................................................................................ 17
Figure 23 : Code Segment for Local and Remote Disk Access ..................................................................................... 18
Figure 24: Graphs for Local and Remote Disk Access .................................................................................................. 18
Figure 25 : Code Segment for System Calls .................................................................................................................. 19
Figure 26 : Table of performance of system calls .......................................................................................................... 20
Figure 27 : Code Segment for different call types .......................................................................................................... 21
Figure 28 : Graphs for Inline Vs Function Vs Recursion calls on Linux ......................................................................... 21
Figure 29 : Graph for Inline Vs Function Vs Recursion on Windows ............................................................................. 22
Figure 30 : Graphs for Forks .......................................................................................................................................... 22
Figure 31 : Code Segment for Arithmetic Operations .................................................................................................... 23
Figure 32 : Graphs for Arithmetic Operations ................................................................................................................. 23
Figure 33 : Graphs comparing Linux and Windows memory access bandwidths ......................................................... 24
Figure 34 : Graphs for Code plus data with different levels of optimization on Linux .................................................... 24
Figure 35 : Graphs for Code plus data with different levels of optimization on Windows .............................................. 24
Figure 36 : Graphs comparing Linux and Windows with O3 optimization level ............................................................. 25
Figure 37 : Graphs comparing the Sub-Looping between Linux and Windows ............................................................. 25
Figure 38 : Graphs comparing disk read bandwidths for Linux and Windows ............................................................... 26
Figure 39 : Graphs comparing disk write bandwidths for Linux and Windows ............................................................... 26
Figure 40 : Graphs comparing Sequential and Random disk access between Linux and Windows ............................. 27
Figure 41 : Graphs for comparing the disk accesses for Page Read Ahead between Linux and Windows .................. 27
Figure 42 : Graphs comparing the times for different call types on Linux and Windows ............................................... 27
System Performance Measurements
March 14, 2003
-3-
1 Introduction
This project involves measuring the performance of an Operating System (OS) based on hardware architecture. It
discusses a set of metrics that can be used to evaluate the performance of certain features of an Operating System.
Measurements were made on two Operating Systems, RedHat Linux 8.0 and Windows XP 5.1. The measurements
for each operating system have been evaluated and compared between the two operating systems. Observations
have been discussed and conclusions have been made with respect to each type of measurement.
2 Project Overview
The goal of the project has been to evaluate the performance of two different operating systems on the same
hardware. The metrics chosen fall into the categories of memory access, disk access, processes and arithmetic
operations.
Architecture
Hardware – Dell Inspiron 2650
Processor – Intel(R) Pentium(R) 4 CPU 1.70 GHz
Cache Size – 512 KB
RAM Size – 256 MB
Hard Disk – 30 GB
Operating Systems
(i) Redhat Linux Version 8.0
(ii) Microsoft Windows XP Version 5.1.2600 Service Pack 1.0
- Cygwin 1.3.20-1 (UNIX environment for Windows)
Compiler
gcc3.2 – C compiler (on Linux and Cygwin)
Benchmarking and performance measurements correspond to sampling a multidimensional space spanned by nonorthogonal axes. Measurements are often not completely isolated from the effects of related factors. Wherever
possible, we have attempted to design our metrics to minimize this interference, and to emphasize/isolate trends
corresponding to the aspect being measured. In other places we have attempted analyses that take into account
major factors affecting the quantity being measured.
We have measured performance parameters and have tried to verify our results with documented features of the OS
(for example – kernel buffer sizes) to see if these OS specifications are actually 'visible' from the application level.
More than to get absolute numbers, out tests attempt to highlight the way the OS manipulates the hardware, when
different code implementations are used. These observations can be analysed in terms of efficient software design, to
use algorithms that use the behaviour of the OS to its advantage, or equivalently, that take measures to avoid
common pitfalls.
We have done all our tests with compiler optimization (O3). Wherever we observed anomalies, we re-ran the tests
without this optimization to check if the anomalies were due to it. Our idea behind this is that we wanted to measure
these performance numbers based on the best possible performance (i.e. with optimization). We are aware of the fact
that the effects of compiler optimization will be different for our test code and any application that may be designed
keeping our results in mind, and we have tried to reduce this effect by measuring relative performance. We made
absolute measurements only for computations like memory / disk accesses and atomic arithmetic operations which
were performed with compiler optimization to get figures for best possible performance.
For all tests, the same code was used on Linux and Windows. The “Cygwin” interface was used to translate the Unix
kernel calls into equivalent (nearly) functionality on Windows. This does not guarantee an exact functional emulation
of the Linux test codes, but is the closest we could get.
Note: In the rest of the report, the terms ‘Linux’ and ‘Windows’ correspond to the specifications mentioned above.
System Performance Measurements
March 14, 2003
-4-
3 Memory
The following measurements were taken to evaluate the effect of memory access rates, on both Linux and Windows
XP.
3.1 Cache Vs RAM Vs Swap Disk
3.1.1 Metric Description
This test was performed to measure the relative memory bandwidth between the cache, RAM and swap disk.
In each run of this experiment a block of memory was allocated and looped over in a Write-Read-Write sequence. The
first set of Writes ensures a warm-cache. The subsequent Read and Write loops were then timed to measure memory
access times.
N= 1000;
si=kbsize*1024/sizeof(int);
data = (int*)malloc(kbsize*1024);
/* sequential write */
/* sequential read */
GetTime(1);
for(i=0;i<N;i++)
for(j=0;j<si;j++) data[j] = j+i ;
writetime1 = GetTime(2);
GetTime(1);
for(i=0;i<N;i++)
for(j=0;j<si;j++) sum = data[j];
readtime = GetTime(2);
free(data);
Figure 1 : Code Segment for Memory Access
Loops for each type of access (write, read and write) were executed several times (1000) for chunks of memory
ranging in size from 1KB to 512 MB in steps of powers of 2. The expected result was that there would be a drop in
bandwidth when the memory size exceeds the cache (512KB) and a further drop beyond the RAM size (256MB).
3.1.2 Results and Analysis
The following graphs were obtained from runs on Linux and Windows.
Cache Vs RAM Vs Disk - Linux
Write1 Bandwidth
Write 2 Bandwidth
3000
2500
2500
Allocated Memory (KB)
Allocated Memory (KB)
Figure 2 : Graphs for Memory Access
These graphs clearly show a bandwidth drop at the cache and RAM boundaries. The RAM bandwidth is
approximately 4 times less than the cache bandwidth for memory writes, and 3 times less than the cache for memory
reads. These ratios are consistent between Linux and Windows.
The average memory bandwidths for the cache, RAM, and Swap Disk were calculated as follows:
System Performance Measurements
March 14, 2003
-5-
26214
13107
65536
32768
8192
16384
4096
2048
512
1024
256
64
5E+05
3E+05
65536
1E+05
32768
8192
16384
4096
2048
512
1024
256
64
128
32
0
8
500
0
16
1000
500
4
Write 2 Bandwidth
1500
1000
2
Read Bandwidth
2000
128
1500
32
2000
8
MB/sec
3500
3000
16
Read Bandwidth
3500
1
MB/sec
Write 1 Bandwidth
Cache Vs RAM Vs Disk - Windows
Bandwidth (MB/sec) = (Memory Chunk Size)/ (Iteration Time over this chunk)
Memory bandwidths in MB/seconds were estimated as follows:
Operating System
Linux
Windows
Cache
Write
Read
2035.709
2888.2118
2132.011 2749.24728
RAM
Write
527.997
563.9692
Read
1264.58296
1323.843481
Swap Disk
Write
Read
7.455528
12.0370933
2.670522
13.10448265
The page fault latency between the cache-RAM and RAM-Swap disk can be estimated from these numbers.
Cache–RAM Page Fault Latency = Average RAM Access Time – Average Cache Access Time
RAM–Swap Page Fault Latency = Average Swap Disk Access Time – Average RAM Access Time
Page Fault Latencies in seconds were estimated as follows using ‘read’ times:
Operating System
Linux
Windows
Cache-RAM
3.088e-6 - 1.352e-6 = 1.736e-6
2.950e-6 – 1.420e-6 = 1.53e-6
RAM-Swap Disk
324.57e-6 – 3.088e-6 = 321.482e-6
298.06e-6 - 2.950e-6 = 295.11e-6
3.2 Code/Memory Optimizations
3.2.1 Metric Description
This test was done to estimate the size of the code being executed. This was done from observations of memory
access times as a function of increasing memory chunk (data) size allocated and looped over by the code. The data
size beyond which the memory bandwidth dropped drastically would indicate the point at which the code plus data
size exceeds cache size. Since the size of the cache and the size of data allocated are known, this would give an
approximation of the amount of cache being used by all current code (plus other cache resident code/buffers).
In this section, the same code as in 3.1 was used with data sizes ranging from 4KB -1020 KB in steps of 8KB. Fine
sampling was done near the cache size. The expected result is a drop in bandwidth when the code plus data size
exceeds the cache. This was expected to change with different levels of compiler optimization, as the code would
become tighter. The code was run at different level of optimizations (no opt., O1- O5) to observe this effect.
Approximate Size of Code = Cache Size – Memory allocated at the point the bandwidth drops
This “code” size is only an estimate as it will include the test code as well as any other cache-resident code. The
allocated chunks of memory were therefore looped over 1000 times, to ensure (as much as possible) that the code
and data corresponding only to the test process was resident in the cache.
3.2.2 Results and Analysis
The following graphs were obtained from runs on Linux and Windows with compiler optimization level O4.
These plots show a large drop in bandwidth between chunk sizes of 400KB to 550KB. This would mean that beyond
400 the resident code and the allocated memory contend for the cache.
System Performance Measurements
March 14, 2003
-6-
Code Plus Data - O4 Optimization- Linux
Write 2 Bandwidth
Write1 Bandwidth
3500
2500
Mb/seco
2000
1500
1000
500
2500
2000
1500
1000
Data Size(KB)
101
964
916
868
820
772
724
676
628
580
532
484
436
388
340
292
244
196
4
964
1012
916
868
820
772
724
676
628
580
532
484
436
388
340
292
244
196
148
100
4
Write 2 Bandwidth
500
0
0
52
Read Bandwidth
4000
3500
3000
148
Mb/seco
3000
100
Read Bandwidth
52
Write1 Bandwidth
Code Plus Data - O4 Optimization- Windows
Data Size(KB)
Figure 3 : Graph for Memory Access for Code plus Data
The approximate size of the code resident in cache is calculated as follows:
Operating System
Linux
Windows
Approximate Code Size
512 – 436 = 76 KB
512 – 476 = 36 KB
The plots for runs at lower levels of optimizations showed noisier behavior within the cache size as compared to
higher levels of complier optimization. The graphs for the runs with different levels of optimizations are shown in the
Section 7 with further analysis.
3.3 Looping and Sub-Looping
3.3.1 Metric Description
In this experiment we compared the write execution times of repeated looping over a large chunk of memory (8MB
and 32MB) to the times observed by splitting the chunk into numerous equal pieces and repeatedly looping over them
separately.
This measurement was an attempt to estimate an optimal piece size at which the execution time is the least. We
expected that at some piece size, the two contending factors of speed due to in-cache looping and overhead due to a
large number of such pieces, would balance out, giving minimum execution time.
Piece sizes ranged from 1KB to 8MB (and 32MB).The execution times while sub-looping were compared to that for
which the piece size was equal to the total chunk size (Number of pieces =1).This approach could be used to
structure loops efficiently.
/*** Looping and Sub-looping ***/
N = 1000;
M = 8192;
si=kbsize*1024/sizeof(int);
data = (int *)malloc(M*1024);
/* multiple loops - sequential write */
temp = (int) (M/kbsize);
GetTime(1);
for(p=0;p<temp;p++)
{
tp = p*si;
for(i=0;i<N;i++)
for(j=0;j<si;j++)data[tp+j] = j+i ;
}
multiloop = GetTime(2);
free(data);
Figure 4 : Code Segment for Looping and Sub-Looping
System Performance Measurements
March 14, 2003
-7-
3.3.2 Results and Analysis
The following graphs show the results for Linux and Windows.
Sub-Looping( 8192 KB)-Linux
Without Optimization
With Optimization
20
Time(sec)ss
Time(sec)ss
With Optimization
Sub-Looping ( 8192 KB)- Linux
15
10
5
96
1
40
24
10
6
25
64
16
4
1
0
4.4
4.2
4
3.8
3.6
2
4
8
Section Size (KB)
16
32
64
128
256
Section Size (KB)
Figure 5 : Graphs for Sub-Looping with 8192KB on Linux
From these graphs we can see that an optimal piece size would be around 64KB. Below 64KB the looping takes
longer, even though the chunks are entirely in the cache. This could be due to a possible overhead of breaking up the
memory chunk into a large number of small pieces. For pieces larger than the cache size, the write time is
approximately 3 times higher.
This test was done for memory chunk sizes of 8MB and 32MB, to see if changing the total size of the problem would
show different trends. No such difference was observed, which leads us to conclude that this method of determining
the optimal piece size scales properly with overall problem size.
Sub-Looping (8192 KB)- Windows
Sub-Looping (8192 KB)- Windows
With Optimization
20
Time(sec)ss
Time(sec)ss
With Optimization
15
10
5
0
1
2
4
8
16
32
64
128 256 512 1024 2048 4096 8192
Section Size (KB)
4.2
4.1
4
3.9
3.8
3.7
3.6
1
2
4
8
16
32
64
128
256
512
Section Size (KB)
Figure 6: Graphs for Sub-Looping with 8192KB on Windows
An algorithm that performs a large number of repetitive operations on large chunks of data, can use this way of
probing system characteristics to dynamically structure its loop structures to optimize for performance. This is
applicable both to computations that involve repetitive operations on large data sets, as well as complex algorithms
that can be broken up into several algorithmically simpler sub-problems. Such algorithms can be classified as ‘Cache
Oblivious Algorithms’ (Ref: [4]) The FFTW and such adaptive algorithms first probe a system to obtain information
about its memory architecture, and then record these measurements. Any subsequent run of the algorithm uses this
system information to dynamically restructure the block sizes/shapes in a block decomposed implementation.
System Performance Measurements
March 14, 2003
-8-
4 Disk IO
The following measurements were taken to evaluate disk I/O performance on both Linux and Windows XP.
4.1 Disk access (read-write-read)
This measurement was performed to evaluate average disk access times for the system being tested, and to examine
the relative disk access times between reading (from the disk) and writing (to a file on this disk).
Three sets of measurements were made on a 512MB data file (RAM = 256MB), using 'fread-fwrite' ,'read-write', and
'mmap-munmap', to compare their relative speeds, given the different buffering schemes used in these
functions/system calls.
The data was accessed from the file in chunks of sizes varying from 1KB to the size of the file. This was to expose
any effects that the chunk size or the number of such chunk accesses, may have on the disk I/O performance.
For each run sequential reads through this 512MB file, ensures that reads are always from the disk. This is because
of the LRU page replacement strategy of the OS. Writes are left to perform depending on the buffering scheme used.
The algorithm involves reading the contents of a 512MB file piece by piece, changing the data in the memory and
writing it back to the file in pieces, and then reading them again from the file. The two reads were expected to show
slight differences in timings because of the locality that would inevitably be created by the earlier reads and writes
having transferred the file pages into main memory (either as clean or dirty pages). Asynchronous and Synchronous
access was controlled as described below.
4.1.1 fopen-fclose-fread-fwrite
4.1.1.1 Metric Description
Stream IO functions were used to test the 'usual' mode of file IO. 'fread' is always synchronous ( upto the page readahead ) but fwrite uses a stream buffer ( default size 8KB ). The 'setvbuf' function was used to control the size of this
stream buffer, to turn it on and off.
/*** DISK - fread ***/
n = NN; // size in KB of target file
nbuff = n/kbsize;
buf = atoi(argv[2]); // 1:with buffering
data = (char*)malloc(kbsize*1024);
/* read the file */
fp = fopen("scratchfile","r");
if(buf == 1) setvbuf(fp,NULL,_IONBF,0);
GetTime(1);
for(i=0;i<nbuff;i++) fread(data,sizeof(char),kbsize*1024,fp);
readtime1 = GetTime(2);
fclose(fp);
for(i=0;i<kbsize*1024;i++)
data[i]= i % 250;
/* write the file */
fp = fopen("scratchfile","r");
if(buf == 1)setvbuf(fp,NULL,_IONBF,0);
GetTime(1);
for(i=0;i<nbuff;i++) fwrite(data,sizeof(char),kbsize*1024,fp);
writetime = GetTime(2);
fclose(fp);
/* read the file */
fp = fopen("scratchfile","r");
if(buf == 1)setvbuf(fp,NULL,_IONBF,0);
GetTime(1);
for(i=0;i<nbuff;i++) fread(data,sizeof(char),kbsize*1024,fp);
readtime2 = GetTime(2);
fclose(fp);
free(data);
Figure 7 : Code Segment for Disk Access using 'fread' and 'fwrite'
4.1.1.2 Results and Analysis
Data sets were recorded with and without (default) stream buffering. The results are as follows.
System Performance Measurements
March 14, 2003
-9-
Disk Write With and Without Bufferring-Linux
Disk Read With and Without Bufferring- Linux
With Bufferring
Without Buffering
MB/sec(s)
25
20
15
10
5
0
Without Buffering
20000000
15000000
10000000
5000000
26
21
44
65
53
6
16
38
4
40
96
10
24
25
6
64
16
4
1
26
21
44
65
53
6
16
38
4
40
96
10
24
25
6
64
16
0
4
1
MB/sec(s)
With Bufferring
Data Size - KB
Data Size - KB
Figure 8 : Graphs for Disk Read And Write Access on Linux
The average disk access time per KB for Linux is 45.142e-6 seconds and for Windows 36.558e-6 seconds. (1/
Average Read Bandwidth)
For Linux, the performance of ‘fread’ appeared to be constant with and without stream buffering up to a user buffer
size equal to the RAM size. This could be because the default stream buffer size is only 8 KB. The drop beyond the
RAM is expected, since at this size, the data is repeatedly swapped to disk.
The ‘fwrite’ performance increases as a function of the user buffer size up to the size of the RAM. The writes without
buffering are only slightly slower.
Disk Write With and Without Bufferring
Disk Read With and Without Bufferring- Windows
With Bufferring
Without Buffering
26
21
44
65
53
6
16
38
4
40
96
10
24
25
6
64
16
4
MB/seconds
30
25
20
15
10
5
0
1
MB/seconds
With Bufferring
Data Size - KB
Without Buffering
600000
500000
400000
300000
200000
100000
0
1
2
4
8
16
32
64
Data Size - KB
Figure 9 : Graphs for Disk Read And Write Access on Windows
On Windows, there is a drastic difference for ‘fread’ with and without stream buffering. Only a small sample has been
taken on Windows without stream buffering because each run was taking approximately 27 minutes.
4.1.2 open-close-read-write
4.1.2.1 Metric Description
System calls to manipulate files via file descriptors were used to test file IO without the overhead of stream buffering,
and with the option of forcing synchronous writes (via the O_SYNC flag in 'open'). These forced synchronous writes
are to ensure that the file pages are cleaned and written to disk.
Data sets were recorded with and without the O_SYNC flag, corresponding to synchronous and asynchronous IO
respectively.
System Performance Measurements
March 14, 2003
- 10 -
/******** READ-WRITE *****/
n = NN; // size in KB of target file
nbuff = n/kbsize;
/* write the file */
if(buf == 1)fd = open("scratchfile",O_RDWR|O_SYNC);
else fd = open("scratchfile",O_RDWR);
GetTime(1);
for(i=0;i<nbuff;i++) write(fd,data,kbsize*1024);
writetime = GetTime(2);
close(fd);
data = (char*)malloc(kbsize*1024);
/* read the file */
if(buf == 1)fd = open("scratchfile",O_RDWR|O_SYNC);
else fd = open("scratchfile",O_RDWR);
GetTime(1);
for(i=0;i<nbuff;i++)
read(fd,data,kbsize*1024);
readtime1 = GetTime(2);
close(fd);
/* read the file */
if(buf == 1)fd = open("scratchfile",O_RDWR|O_SYNC);
else fd = open("scratchfile",O_RDWR);
GetTime(1);
for(i=0;i<nbuff;i++) read(fd,data,kbsize*1024);
readtime2 = GetTime(2);
close(fd);
free(data);
for(i=0;i<kbsize*1024;i++) data[i]= i % 250;
Figure 10 : Code Segment for Disk Access using 'open' and 'close'
4.1.2.2 Results and Analysis
Disk Write - Synchronous Vs Asynchronous - Linux
Disk Read - Synchronous Vs Asynchronous - Linux
Read 1 Sync
Read 2 Sync
Read 1 - Async
Write Sync
Read 2 -Async
35
25
30
MB/sec(s)
MB/sec(s)
Write Async
30
40
25
20
15
20
15
10
10
5
5
5E+05
3E+05
65536
1E+05
32768
8192
16384
4096
2048
512
1024
256
64
128
32
8
16
4
2
5E+05
3E+05
65536
1E+05
32768
8192
16384
4096
2048
512
1024
256
64
128
32
8
16
4
2
1
1
0
0
Chunk Size KB
Chunk Size KB
Figure 11 : Graphs for Disk Read And Write Access on Linux
For Linux, Synchronous reads (Read 1 and Read 2) show similar trends as reads are always from the disk. In the
case of asynchronous reads the first read (Read 1) is from the disk and hence follows the curve for synchronous
reads. The second read (Read 2), appears much slower. This could be explained due to the fact that the
asynchronous writes do not always write to the disk, thus filling up the RAM with dirty pages. The writes will be written
to disk only when the dirty pages exceed the size of the RAM. A subsequent read, will then have to first clean the
pages in the RAM by writing them to disk and then read in fresh data from the disk (which had been written to disk
during the write - LRU).
Synchronous writes show a lower bandwidth compared to asynchronous writes since the synchronous writes ensures
that the data is written to disk and not stored in the RAM as dirty pages.
The corresponding measurements for Windows show no such differences between synchronous and asynchronous
disk access. This could be due to the fact that the ‘cygwin’ environment on Windows may not be able to correctly
translate the O_SYNC flag used in the read and write system calls. The reads on Windows however show a drop in
throughput at the cache size.
System Performance Measurements
March 14, 2003
- 11 -
Disk Read - Synchronous Vs Asynchronous - Windows
Read 2 Sync
Read 1 - Async
Disk Write - Synchronous Vs Asynchronous- Windows
Read 2 -Async
Write Sync
30
30
25
25
20
MB/sec(s)
15
10
Write Async
20
15
10
5
5
65536
1E+05
32768
8192
4096
2048
512
1024
256
128
64
32
8
16384
Chunk Size KB
16
4
1
65536
0
1E+05
32768
8192
16384
4096
2048
512
1024
256
64
128
32
8
16
4
2
1
0
2
MB/sec(s)
Read 1 Sync
Chunk Size KB
Figure 12 : Graphs for Disk Read And Write Access on Windows
4.1.3 mmap-msync-mmap
4.1.3.1 Metric Description
Kernel calls that manipulate virtual memory were used to measure disk access times (without the data copying
overhead of fread and read which copy from system buffers into user buffers). Mapping chunks of memory does not
load file pages. Accessing them by dereferencing, triggers loads from the disk into the physical memory. 'msync' and
'munmap' were used to ensure that the data written to the mapped areas are synchronously written to the disk.
/****** MMAP *******/
/* read the file */
fd = open("scratchfile",O_RDWR);
GetTime(1);
for(i=0;i<nbuff;i++)
{
startaddress=mmap(0,kbsize*1024,PROT_READ|
PROT_WRITE,MAP_SHARED,fd,0);
for(j=0;j<kbsize*1024;j++)data[j]=*((char*)(startaddress)+j);
err = munmap(startaddress,kbsize*1024);
lseek(fd,kbsize*1024,SEEK_CUR);
}
readtime1 = GetTime(2);
close(fd);
for(i=0;i<kbsize*1024;i++) data[i]= i % 250;
/* write the file */
fd = open("scratchfile",O_RDWR);
GetTime(1);
for(i=0;i<nbuff;i++)
{
startaddress = mmap(0,kbsize*1024,PROT_READ|
PROT_WRITE,MAP_SHARED,fd,0);
for(j=0;j<kbsize*1024;j++) *((char*)startaddress + j) = data[j];
err = munmap(startaddress,kbsize*1024);
lseek(fd,kbsize*1024,SEEK_CUR);
}
writetime = GetTime(2);
close(fd);
/* read the file */
fd = open("scratchfile",O_RDWR);
GetTime(1);
for(i=0;i<nbuff;i++)
{
startaddress=mmap(0,kbsize*1024,PROT_READ|
PROT_WRITE,MAP_SHARED,fd,0);
for(j=0;j<kbsize*1024;j++) data[j]= *((char*)startaddress + j);
err = munmap(startaddress,kbsize*1024);
lseek(fd,kbsize*1024,SEEK_CUR);
}
readtime2 = GetTime(2);
close(fd);
free(data);
Figure 13 : Code Segment for Disk Access using 'mmap' and 'munmap'
4.1.2.2 Results and Analysis
The average disk access time per KB is approximately 2.579e-6 seconds.
Both reads and writes using ‘mmap/munmap’ always access the disk when dereferencing the data. This is why the
read and write curves show similar trends. There is a performance peak when the chunk sizes are just below the size
of the cache (512 KB). This could be due to the fact that the size of the user buffers into which the data is being read
is small enough to fit in the cache and large enough to amortize the overhead of having to read a larger number of
smaller chunks.
System Performance Measurements
March 14, 2003
- 12 -
Disk Access with MMAP
Write Bandwidth
Read2 Bandwidth
52428
26214
13107
65536
32768
16384
8192
4096
2048
1024
512
256
128
64
32
8
16
400
350
300
250
200
150
100
50
0
4
MB/sec(s)
Read1 Bandwidth
Chunk Size -KB
Figure 14 : Graph for Disk Access using 'mmap' and 'munmap'
There is rise in performance at chunk sizes just larger than the cache and this dips slightly as the data size
approaches the RAM size. This could be due to a similar effect as the hump observed just before the cache size,
which would indicate a balance point between user buffer size approaching that of the RAM, and the advantage of
keeping it entirely in the RAM.
These three measurements demonstrate the differences in cost incurred with the use of different file access
mechanisms. Depending on the application being designed, if file IO is a dominant feature, the choice of an
appropriate file access mechanism would give substantial benefit.
Note : On WinXP we could only perform the tests with 'fopen-fclose-fread-fwrite' and ‘open-close-read-write’. ‘mmap’
did not work - not possible to support it via cygwin.
4.2 Sequential Vs Random Reads
4.2.1 Chunk Sizes in powers of 2
4.2.1.1 Metric Description
This test was performed to compare times for sequential and random disk access (read), keeping in mind factors like
page read-ahead and physical proximity on the disk.
/*******SEQ_RAN *******/
/* read the file = random */
GetTime(1);
for(i=0;i<nbuff;i++)
{
off = (int)((double)abs(rand())/RAND_MAX*(nb-1)*kbsize*1024);
}
rant = GetTime(2);
/* random seeks are NOT to page boundaries.. */
fd = open("scratchfile",O_RDWR|O_SYNC);
GetTime(1);
for(i=0;i<nbuff;i++)
{
off = (int)((double)abs(rand())/RAND_MAX*(nb-1)*kbsize*1024);
lseek(fd,off,SEEK_SET);
read(fd,data,kbsize*1024);
}
rantime = GetTime(2) - rant;
close(fd);
/* read the file - sequential */
fd = open("scratchfile",O_RDWR|O_SYNC);
off=0;
GetTime(1);
for(i=0;i<nbuff;i++)
{
lseek(fd,0,SEEK_CUR);
read(fd,data,kbsize*1024);
}
seqtime = GetTime(2);
close(fd);
free(data);
Figure 15 : Code Segment for Sequential and Random Disk Access
System Performance Measurements
March 14, 2003
- 13 -
The fixed size 512MB data file was used and the data was read in different sized chunks as before (1KB to the file
size). This was to attempt to bring out the effects of chunk size versus the number of such chunks read, on sequential
and random access patterns. The number of such chunks depended on the size of each chunk, each sequential run
reading through the entire 512MB data file. The same number of random chunk accesses were timed. (This makes is
a pseudo-random access pattern, since although the beginning of the blocks are chosen at random, the data in each
chunk is read sequentially. The same number of such chunks were read sequentially and randomly.
4.2.1.2 Results and Analysis
Data Read - Linux
Data Read - Windows
Random
Sequential Access
10000
100000
1000
MB/sec(s)
10000000
1000
10
100
10
Chunk Size KB
65
53
6
16
38
4
40
96
10
24
25
6
64
16
4
0.1
1
16
38
4
40
96
10
24
25
6
64
4
1
Random Access
1
0.1
16
MB/sec(s)
Sequential Access
Chunk Size KB
Figure 16 : Graphs for Sequential and Random Disk Access on Linux and Windows
As expected the overall trends showed better performance for sequential access compared to random access.
At chunk sizes smaller than the cache (512KB), both type of accesses loop within the cache and the dominant reason
for the difference in performance would be the overhead for random disk access. However, the performance seems to
decrease with increasing chunk size, something not observed in the other file access tests which did not sample the
curve finely.
However for chunk sizes around the size of the cache (256KB – 1MB), the sequential and random accesses show
similar performance. The minimum distance between the two curves occurs at the cache size of 512KB. This could
the result of the two interacting factors of “Cache Vs RAM memory access” and “Sequential Vs Random” disk access.
The sequential access shows an expected performance drop at the file cache size. One point to note is that at larger
chunk sizes, the number of chunks being read decreases, and this decreases the degree of randomness.
For chunk sizes beyond the size of the cache, the difference is mainly because of the two varying access patterns.
They do not differ as much as below the cache size, since this degree of randomness has also decreased. (Note: All
timings were for reading a fixed total amount of data from the disk – 512MB.)
This can also be understood from the fact that Somehow this results in similar performance of both
Similar trends are observed for Windows too. We have not yet been able to devise a plausible explanation for this
trend.
4.2.2 Page Read Ahead
4.2.2.1 Metric Description
We ran another test, using chunk sizes ranging from 1KB to 128KB. This was an attempt to see at what chunk size
the trend changes, indicating the limit of the advantages of the page read-ahead.
System Performance Measurements
March 14, 2003
- 14 -
4.2.2.2 Results and Analysis
For Linux, the sequential access performance is an order of magnitude higher than that for random access. Also the
random access bandwidths are noisier than sequential because of different access patterns. Both however show a
drop in bandwidth between chunk sizes of 60 – 65 KB.
Disk Read- Random - Linux
Disk Read- Sequential - Linux
Random
100000
10000
10000
127
121
115
109
97
103
91
85
79
73
67
61
55
49
43
37
31
1
127
121
115
109
97
103
91
85
79
73
67
61
55
49
43
37
31
25
19
1
7
1
13
10
25
100
10
19
100
1000
7
1000
13
MB/sec(s)
100000
1
MB/sec(s)
Sequential
Chunk Size KB
Chunk Size KB
Figure 17 : Graphs for Sequential and Random Disk Access on Linux to determine Page Read Ahead Size
Below the cache size, the performance of sequential access is an order of magnitude higher than random, but beyond
approximately 64KB, the bandwidths for both sequential and random are at the same order of magnitude (except for
the noise due to the random access pattern). This suggests that no more performance gains are visible due to the
page read-ahead, and both sequential and random disk access have comparable performance.
Disk Read- Sequential - Windows
Disk Read- Random - Windows
Sequential
Random
10000
100000
MB/sec(s)
1000
100
1000
100
10
10
1
Chunk Size KB
127
121
115
109
103
97
91
85
79
73
67
61
55
49
43
37
31
25
19
13
7
127
121
115
109
103
97
91
85
79
73
67
61
55
49
43
37
31
25
19
13
7
1
1
1
MB/sec(s)
10000
Chunk Size KB
Figure 18 : Graph for Sequential and Random Disk Access on Windows to determine Page Read Ahead Size
This could give an estimate of the page read ahead employed by the operating system. For a page size of 4 KB , this
would correspond to a read ahead of 16 pages (64/4) on Linux. This would be the combined effects of page readahead buffering at different levels.
We saw no such trends for Windows except for the fact that the sequential access was an order of magnitude better
than random access and less noisy.
4.3 Read Vs mmap
4.3.1 Metric Description
This test was done to compare the collective operation of file access ( read the file in chunks ‘read’, change data in
each chunk ’loop’ and write it back to disk synchronously ‘write’ ) using 'read-write' and 'mmap-munmap'. In both
cases timings were obtained for the 'read', 'loop' and 'write' phases and compared.
System Performance Measurements
March 14, 2003
- 15 -
/**** READ Vs MMAP ****/
GetTime(1);
err = munmap(startaddress,kbsize*1024);
lseek(fd,kbsize*1024,SEEK_CUR);
twrite += GetTime(2);
}
mmaptime = GetTime(3) - mmaptime;
close(fd);
/* Get timing for GetTime !!! */
tread=0;tloop=0;twrite=0;
gtime = GetTime(3);
for(i=0;i<nbuff;i++)
{
GetTime(1);
tread += GetTime(2);
GetTime(1);
tloop += GetTime(2);
GetTime(1);
twrite += GetTime(2);
}
gtime = GetTime(3) - gtime;
printf("mmaptotal= %10.9lf mprd= %10.9lf mlp= %10.9lf mwr=
%10.9lf ",mmaptime,tread,tloop,twrite);
/* Read and Write the File using read,write */
tread=0;tloop=0;twrite=0;
fd = open("scratchfile",O_RDWR|O_SYNC);
off=0;
readtime = GetTime(3);
for(i=0;i<nbuff;i++)
{
GetTime(1);
read(fd,data,kbsize*1024);
tread += GetTime(2);
printf("bufsize= %d KB nbuff= %d timers= %10.9lf"
,kbsize,nbuff,gtime);
/* Read and Write the File using mmap,munmap */
tread=0;tloop=0;twrite=0;
fd = open("scratchfile",O_RDWR|O_SYNC);
mmaptime = GetTime(3);
for(i=0;i<nbuff;i++)
{
GetTime(1);
startaddress =mmap(0,kbsize*1024,PROT_READ|
PROT_WRITE,MAP_SHARED,fd,0);
tread += GetTime(2);
GetTime(1);
for(j=0;j<kbsize*1024;j++) data[j]=(data[j]+5)%250;
tloop += GetTime(2);
GetTime(1);
lseek(fd,-1*kbsize*1024,SEEK_CUR);
write(fd,data,kbsize*1024);
twrite += GetTime(2);
GetTime(1);
for(j=0;j<kbsize*1024;j++) *((char*)(startaddress)+j) =
((*((char*)(startaddress)+j))+5)%250;
tloop += GetTime(2);
}
readtime = GetTime(3)-readtime;
close(fd);
free(data);
Figure 19 : Code Segment for 'read' Vs 'mmap'
4.3.2 Results and Analysis
Read Vs MMAP for All Buffer Sizes
Read Vs MMap for Small Buffer Sizes
MMAP
Synchronous
600
500
400
300
200
100
0
Time (sec) on
Time(sec) on
MMAP
Synchronous
100
80
60
40
20
0
1
2
4
1
Chunk Size KB
2
4
8
16
32
64
128
256
512 1024 2048 4096 8192
Chunk Size KB
Figure 20 : Graphs for Disk Access using 'read' and 'mmap'
As seen from these graphs, synchronous file access using mmap is much faster than when using ‘read’ and ‘write’
system calls. This is due the overhead of copying data into user buffers while using ‘read’ and ‘write’.
System Performance Measurements
March 14, 2003
- 16 -
MMAP
Write
Read
80
51
2
10
24
20
48
40
96
81
92
16
38
4
32
76
8
65
53
6
25
6
4
65536
32768
16384
8192
4096
2048
1024
512
256
128
64
32
16
0
8
20
0
4
5
2
Write
40
64
10
Loop
60
12
8
15
32
20
8
Time(sec)(s)
100
25
16
Loop
30
1
Time(sec)(s)
Read
Read Write
Chunk Size KB
Chunk Size KB
Figure 21 : Individual graphs for 'mmap' and normal ' read-write'
The above plots show the break-up of the read, loop, and write phases for the two access methods.
Note : The two plots have different scales, so for clarity, we have included the numbers in a table below.
The following analysis compares the ‘read’, ’loop’, and ‘write’ phases of the two implementations.
The reading using 'read' is slower because of the extra copy into the user buffer. The reading using 'mmap' is much
faster since only page table entries are modified to include the new mapped region of virtual memory - no pages are
loaded from the disk. ‘read’ timings for chunk sizes smaller than the cache, is much less than for larger chunk sizes.
The looping using 'read' is faster because the array being looped over is already in the memory. The looping using
'mmap' is much slower because this is when the pages get loaded into the physical memory, read and modified.
(mmap loads pages only when the corresponding addresses are dereferenced.)
The writing using 'write' is slower because it copies from the user buffer to the system buffer and then to disk ( forced
synchronous – O_SYNC). The writing using 'munmap' is faster because the dirty pages are written to disk without any
extra buffer copies.
Buffer
Size
(KB)
MMAP Total
(sec)
MMAP
Read (sec)
MMAP
Loop (sec)
MMAP
Write (sec)
RW Total
(sec)
Read(sec)
Loop(sec)
Write(sec)
1
26.487
1.340
21.289
2.257
519.436
2.429
9.949
504.905
2
23.216
0.682
20.569
1.138
280.510
1.241
9.690
268.498
4
20.890
0.359
19.559
0.567
95.227
1.030
9.676
83.976
8
20.080
0.180
19.397
0.291
77.656
0.912
9.645
66.823
16
19.713
0.091
19.360
0.148
67.543
0.853
9.671
56.871
32
19.543
0.042
19.352
0.089
65.375
0.893
9.672
54.735
64
19.466
0.022
19.362
0.047
65.755
0.971
9.609
55.134
128
19.447
0.013
19.380
0.030
90.765
1.288
9.720
79.725
256
19.604
0.009
19.561
0.026
77.520
1.802
9.763
65.935
512
19.992
0.011
19.928
0.048
67.992
2.339
9.626
56.015
1024
20.086
0.007
20.012
0.064
57.188
30.655
9.849
16.678
2048
20.147
0.004
20.063
0.078
63.309
31.656
9.780
21.870
4096
20.255
0.002
20.175
0.076
59.698
25.563
9.776
24.357
8192
20.787
0.394
20.325
0.066
60.346
24.324
9.771
26.251
16384
20.554
0.008
20.481
0.065
62.089
23.913
9.763
28.412
32768
21.191
0.404
20.723
0.064
61.896
23.766
9.833
28.297
65536
21.322
0.000
21.253
0.069
60.677
22.124
9.775
28.778
Figure 22 : Table for read and mmap timings
System Performance Measurements
March 14, 2003
- 17 -
4.4 Remote File Access
4.4.1 Metric Description
This test was done to get an estimate of the overhead involved in accessing a file on a remote disk (nfs mount), as
compared to accessing a file on a locally mounted disk (/tmp). This test was performed only on Linux, and on a
different machine connected to a network with remotely mounted disks.
A file was created by performing write operations in different sized chunks. Different test runs were performed with
this file being created on a local disk (/tmp) and on a remotely mounted disk. These writes use the ‘write’ system call
with the O_SYNC flag set to force synchronous writes for each instance of the user buffer.
/*******SEQ_RAN *******/
/* read the file = random */
GetTime(1);
for(i=0;i<nbuff;i++)
{
off = (int)((double)abs(rand())/RAND_MAX*
(nb-1)*kbsize*1024);
}
rant = GetTime(2);
rantime = GetTime(2) - rant;
close(fd);
/* read the file - sequential */
fd = open("scratchfile",O_RDWR|O_SYNC);
off=0;
GetTime(1);
for(i=0;i<nbuff;i++)
{
lseek(fd,0,SEEK_CUR);
read(fd,data,kbsize*1024);
}
seqtime = GetTime(2);
close(fd);
free(data);
/* random seeks are NOT to page boundaries.. */
fd = open("scratchfile",O_RDWR|O_SYNC);
GetTime(1);
for(i=0;i<nbuff;i++)
{
off = (int)((double)abs(rand())/RAND_MAX*
(nb-1)*kbsize*1024);
lseek(fd,off,SEEK_SET);
read(fd,data,kbsize*1024);
}
Figure 23 : Code Segment for Local and Remote Disk Access
4.4.2 Results and Analysis
File Access - Remote-Linux
File Access - Local - Linux
Remote Access
12
0.14
10
0.12
MB/sec(s
MB/sec(s)
Local Access
8
6
0.1
0.08
0.06
4
0.04
2
0.02
0
0
1
2
4
8
16
32
64
128 256 512 1024 2048 4096 8192
1
2
4
8
16
32
64
128
256
512
1024 2048 4096 8192
Access Unit Size- Bytes
Access Unit Size - Bytes
Figure 24: Graphs for Local and Remote Disk Access
The overall file access bandwidth for the remote disk is two orders of magnitude lower than that for the local disk.
This is due to the network overhead.
Also – the bandwidth in both cases increases exponentially with chunk size, which is an indication of lesser number of
packets being transferred. (these writes are synchronous)
System Performance Measurements
March 14, 2003
- 18 -
Data transfer rates plus timings for file creation using a chunk size of 64 bytes is estimated from the remote access
time, as 63.52 milliseconds.
Data transfer rate calculated from ‘ping’ statistics, which includes the overhead due to the network protocol layers is
observed to be 0.207 milliseconds. These measurements were done over a 100Mbps Ethernet link.
5 Processes
The following tests were run on Linux and Windows to measure the absolute times to perform system calls, as well as
the relative costs incurred between executing a function in different contexts (within the same process vs forking a
new process).
5.1 System Calls
5.1.1 Metric Description
The following functions were called repeatedly and the number of possible calls per second was calculated. This gives
an approximate absolute value of the overhead of performing these functions. These timings including looping
overhead required to perform several repetitions.
'getpid()' being the simplest available kernel call, was used to give an approximate overhead of a kernel function call.
'malloc-free' were used to allocate/deallocate 4 bytes of memory repeatedly to calculate approximate overhead of
memory allocation.
'file open-close' were used to measure the time overhead involved in the process of opening and closing an existing
file.
'file create-delete' overhead was measured by performing the test of file open-close, but always creating a new file
and deleting it after closing.
'fork' was used to give an approximate overhead of procedure invocation.
/*** Process - getpid - openclose...malloc... ***/
/* Call getpid() */
for(i=0;i<N/100;i++)
GetTime(1);
{
for(i=0;i<N/100;i++) getpid();
fd=open("temp",O_RDWR
difftime = GetTime(2);
|O_SYNC|O_CREAT);
close(fd);
/* File Open Close - fopen-fclose */
}
fp=fopen("temp","w");
difftime = GetTime(2);
fclose(fp);
GetTime(1);
/* File Create Delete */
for(i=0;i<N/100;i++)
GetTime(1);
{
for(i=0;i<N/100;i++)
fp=fopen("temp","w");
{
fclose(fp);
fd=open("temp",O_RDWR|O_SYNC|O_CREAT);
}
close(fd);
difftime = GetTime(2);
unlink("temp");
}
/* File Open Close - open-close */
difftime = GetTime(2);
GetTime(1);
/* mem malloc free */
GetTime(1);
for(i=0;i<N/10;i++)
{
data=(int*)malloc(4);
free(data);
}
difftime = GetTime(2);
/* forking */
GetTime(1);
for(i=0;i<N/1000;i++)
{
ferr=fork();
if(ferr==0)exit(0);
else wait(&stat);
}
difftime = GetTime(2);
Figure 25 : Code Segment for System Calls
System Performance Measurements
March 14, 2003
- 19 -
5.1.2 Results
Linux
With Optimizations
(Million Calls/sec)
Call
Without
Optimizations
(Million Calls/sec)
Windows
Without
Optimizations
(Million Calls/sec)
Without
Optimizations
(Million Calls/sec)
1.486661667
1.466027333
125
42.65089633
0.094614333
0.073689667
0.004799
0.057701
File
Getpid
Open-Close-fopenfclose
Open-Close-openclose
0.389528
0.387609
0.007438
0.261525
File
Create- Delete
0.055121333
0.04593
0.002204
0.034418444
Malloc
Free
11.692463
9.627664667
1.297353
7.539160222
Process
Fork
0.003328667
0.00332
0.000101
0.002249889
File
Figure 26 : Table of performance of system calls
5.2 Function performance with different implementation styles.
5.2.1 Metric Description
This test involved timing different implementations of functionally identical code. The code consists of an inner and
outer ‘for’ loop, which has been implemented as inline code, repeated function calls, recursion, and forking child
processes that executed the inner loop. The different implementations were proved to be functionally identical, by
calculating a sum as a by-product of the nested looping, and making sure that all implementations returned identical
sums.
We tried to ensure that the compiler optimization does not override our attempts to measure the time differences, by
ensuring reuse of variables that get changed inside the loops. Our goal here was not to time the actual function
invocation or forking times, but to measure the relative execution times for different implementations of the same
functionality.
(i) Inline
There were a simple pair of nested loops in the main(), with sums being calculated within the inner loop and
accumulated in the outer loop. (The malloc is redundant in all these tests since the malloc-free operations are always
performed within the same scope)
(ii) Function Call
The outer loop was in the main() while the rest of the computation was performed as a function call. The overhead
incurred here is the cost of 'ntimes' function calls.
(iii) Recursion
The outer loop corresponds to stepping down through the recursion, and the inner loop was implemented on the
unwinding. This corresponds to a recursion that was 'ntimes' levels deep.
(iv) Forking
The outer loop was in the main() while the inner loop computation was done in a child process. The child process
would only perform the inner loop computation, send a return value (partial sum) to the parent process by writing an
integer into a system pipe and terminate using exit(0).The parent would wait for the child to terminate, and after
executing 'wait(&stat)' it would read the integer from the pipe and continue the outer loop. The 'wait()' was required to
ensure that the process table entries for the child processes were deleted and did not remain as zombie processes.
System Performance Measurements
March 14, 2003
- 20 -
/*** Process - inline vs fn vs recursion vs fork
***/
/* Recursive Calls */
tsum=0;sum=0;
/* Forking ! */
err = pipe(fildes);
tsum=0;sum=0;
/* Inline */
GetTime(1);
for(i=0;i<ntimes;i++)
{
sum=0;
data = (int*)malloc(sizeof(int)*size);
for(j=0;j<size;j++) data[j] = j;
for(j=0;j<size;j++)
{
data[j] = data[j] + i+j;
sum += data[j];
}
free(data); sum += sum;
} inlinetime=GetTime(2);
GetTime(1);
tsum = testrec(ntimes-1);
rectime=GetTime(2);
GetTime(1);
for(i=0;i<ntimes;i++)
{
ferr=fork();
if(ferr<0) printf("err = %d \n",ferr);
if(ferr==0)
{
sum=0;
data=(int*)malloc(sizeof(int)*size);
for(j=0;j<size;j++) data[j] = j;
for(j=0;j<size;j++)
{
data[j] = data[j] + i+j;
sum += data[j];
}
free(data);
write(fildes[1],&sum,sizeof(int));
exit(0);
}
else
{
wait(&stat);
read(fildes[0],&sum,sizeof(int));
tsum += sum;
}
}
forktime=GetTime(2);
int testrec(int i)
{
int *data=NULL;
int j,tsum=0,sum=0;
if(i>0) tsum = testrec(i-1);
data = (int*)malloc(sizeof(int)*size);
for(j=0;j<size;j++) data[j] = j;
sum=0;
for(j=0;j<size;j++)
{
data[j] = data[j] + i+j;
sum += data[j];
}
free(data);
tsum += sum;
return(tsum);
}
/* Function calls */
tsum=0;sum=0;
GetTime(1);
for(i=0;i<ntimes;i++) tsum += testfn(i);
fntime=GetTime(2);
int testfn(int i)
{
int sum=0,j;
int *data=NULL;
sum=0;
data = (int*)malloc(sizeof(int)*size);
for(j=0;j<size;j++) data[j] = j;
for(j=0;j<size;j++)
{
data[j] = data[j] + i+j;
sum += data[j];
}
free(data);return(sum);
}
Figure 27 : Code Segment for different call types
3.3.2 Results and Analysis
The following graphs shows the results for Linux and Windows
These plots show the execution times for the different implementations as a function of the number of iterations/fn
calls/recursion levels/processes forked.
Inline Vs Function Vs Recursion Without Optimization - Linux
Inline Vs Function Vs Recursion With Optimization- Linux
Inline
Recursion
Recursion
91
00
81
00
71
00
61
00
51
00
41
00
31
00
21
00
10
0
91
00
81
00
71
00
61
00
51
00
41
00
31
00
21
00
11
00
Function Call
0.005
0.004
0.003
0.002
0.001
0
11
00
Time(sec)(s)
Function Call
0.003
0.0025
0.002
0.0015
0.001
0.0005
0
10
0
Time(sec)(s)
Inline
Number of Iterations
Number of Iterations
Figure 28 : Graphs for Inline Vs Function Vs Recursion calls on Linux
All three implementation schemes scale linearly with the number of repetitions. The larger slope of the line for
recursion indicates it does not scale as well as the inline and function implementations.
System Performance Measurements
March 14, 2003
- 21 -
An interesting observation is that without compiler optimization the inline implementation was faster than function
calls. But with compiler optimization (O3) the function call implementation ran faster than the optimized inline code!
Inline Vs Function Vs Recursion With Optimization- Windows
Time(sec)(s)
Inline
Function Call
Recursion
91
00
81
00
71
00
61
00
51
00
41
00
31
00
21
00
10
0
11
00
0.03
0.025
0.02
0.015
0.01
0.005
0
Number of Iterations
Figure 29 : Graph for Inline Vs Function Vs Recursion on Windows
No such linear trends were observed for the different implementations on Windows beyond a certain number of
repetitions. This could be attributed to the ‘cygwin’ environment on which the tests were run.
Fork Calls- Windows
Fork Calls- Linux
With Optimization
Without Optimization
91
00
81
00
71
00
61
00
51
00
91
00
81
00
71
00
61
00
51
00
41
00
31
00
21
00
11
00
10
0
0
41
00
1
31
00
2
120
100
80
60
40
20
0
21
00
3
11
00
4
10
0
Time(sec)(s)
Time(sec)(s)
With Optimization
Number of Iterations
Number of Iterations
Figure 30 : Graphs for Forks
For Linux, the fork implementation showed no difference with and without optimization. For Linux and Windows, the
performance scales linearly.
6 Arithmetic Operations
This test measures the number of integer or float operations that can be done in a second. Another test compares the
execution time of integer operations with and without optimizations.
6.1 Integer and Float Operations
6.1.1 Metric Description
Integer and float arithmetic operations of addition, subtraction, multiplication and division were tested in this
experiment.
Each operation was executed in a “for” loop with the operation being performed once per iteration. N iterations of
these loops were timed. These timings include the looping overhead (mainly an increment and a compare operation),
which is common to all the tests, and hence can be ignored if only a pure comparison is being made. O4 compiler
optimization was used to ensure that while the iterations executed, all values stayed in the registers. This would
isolate the test from the overheads of memory accesses.
System Performance Measurements
March 14, 2003
- 22 -
The loops were structured as shown below, with all variables declared as integers (or floats). N is always declared as
an integer. Several runs were taken and the average execution time was used to calculate the approximate number of
iterations executed per second.
/*** Sample Arithmetic Operation Test. ***/
N=100000000;
/* Integer Addition */
GetTime(1);
for(i=0;i<N;i++)
acc = i + inc;
difftime = GetTime(2);
printf("Int Add %lf MOps/sec\n",N/(difftime*1e+6));
Figure 31 : Code Segment for Arithmetic Operations
6.1.2 Results and Analysis
Millions of Oprns/ second dd
Arithmetic Operations- Linux Vs Windows
Linux With Optimizations
Windows with Optimizations
Linux Without Optimizations
Windows Without Optimizations
1200
1000
800
600
400
200
0
Integer
Addition
Integer
Subtraction
Integer
Multiplication
Integer
Division
Float Addition
Float
Subtraction
Float
Multiplication
Float Division
Figure 32 : Graphs for Arithmetic Operations
These numbers show us that overall, integer operations are faster than float operations. Operations speed up with
compiler optimization, with the exception of integer division. Windows and Linux gave comparable numbers for all
these tests.
7 Linux Vs Windows
In this section, the performance of Linux and Windows are compared for the same measurements of memory access,
disk access and processes. Please note that Windows measurements were done on ‘cygwin’ environment which
might attribute to some unexpected results. The same compiler was used for compiling the programs for the test. One
thing that can be noted for memory and disk performance is that Windows seems to use the cache more effectively
than Linux.
7.1 Memory
Memory – Cache Vs RAM Vs Swap Disk
As excepted there is a considerable drop in bandwidth for both memory read and write for both Linux and Windows
when the size of the allocated memory is nearly the same as the cache size. In both cases of read and write the
bandwidths of Windows drops much earlier than Linux.
System Performance Measurements
March 14, 2003
- 23 -
Memory Write Access - Cache Vs RAM Vs SWAP
Memory Read Access - Cache Vs RAM Vs SWAP
Linux - Write2 Bandwidth
Windows - Read Bandwidth
Windows - Write 2 Bandwidth
2500
MB/sec(s)
3500
3000
2500
2000
1500
1000
500
0
2000
1500
1000
500
Allocated Memory KB
Allocated Memory KB
Figure 33 : Graphs comparing Linux and Windows memory access bandwidths
Code Plus Data – Different Levels of Optimizations
Effects of Code Optimization on Cache Vs RAM memory access times
In our test for comparing memory access times between the cache, RAM and the disk, and for estimating the cacheresident code size, different levels of compiler optimization showed different trends. This is shown in the following
plots.
Code Plus Data - Read-Linux
Code Plus Data - Write-Linux
O0
O5
Data Size KB
O3
O4
O5
1012
964
916
868
820
772
724
676
628
580
532
484
4
1012
964
916
868
820
772
724
676
628
580
532
484
436
388
340
292
244
196
148
100
4
52
0
436
500
388
1000
O2
3500
3000
2500
2000
1500
1000
500
0
340
1500
O1
292
2000
MB/sec
MB/sec
2500
244
O4
196
O3
148
O2
100
O1
52
O0
Data Size KB
Figure 34 : Graphs for Code plus data with different levels of optimization on Linux
Code Plus Data - Write-Windows
O0
O1
O2
O3
O4
Code Plus Data - Read-Windows
O5
O0
O1
O2
O3
O4
O5
Data Size KB
Data Size KB
Figure 35 : Graphs for Code plus data with different levels of optimization on Windows
System Performance Measurements
March 14, 2003
- 24 -
964
1012
916
868
820
772
724
676
628
580
532
484
436
388
340
292
244
196
1012
964
916
868
820
772
724
676
628
580
532
484
436
388
340
292
244
196
148
100
4
52
0
148
500
100
1000
4
1500
4000
3500
3000
2500
2000
1500
1000
500
0
52
2000
MB/sec(s)
MB/sec(s)
2500
65536
32768
16384
8192
4096
2048
512
1024
256
64
128
32
8
16
4
2
1
65536
32768
8192
16384
4096
2048
512
1024
256
64
128
32
8
16
4
2
0
1
MB/sec(s)
Linux - Read Bandwidth
From these plots it can be seen that without compiler optimization, there is no apparent difference in performance
between the cache and the RAM. With any level of optimization, this divide is apparent. At lower levels of
optimization, the performance curves are noisier than those for higher levels of optimizations. This is an indication of
the fact that the code gets more compact with higher levels of optimization, thus resulting in less contention between
the cache and the RAM and hence smoother trends.
Code Plus Data - Read- Linux Vs Windows with O3 Optimization
Code Plus Data - Write- Linux Vs Windows with O3 Optimization
Linux
Linux
Windows
3500
3000
2500
2000
MB/sec(s)
1500
1000
500
2500
2000
1500
1000
Data Size KB
964
1012
916
868
820
772
724
676
628
580
532
484
436
388
340
292
244
196
148
4
964
1012
916
868
820
772
724
676
628
580
532
484
436
388
340
292
244
196
148
100
4
52
100
500
0
0
52
MB/sec(s)
Windows
Data Size KB
Figure 36 : Graphs comparing Linux and Windows with O3 optimization level
The same compiler was used on both Linux and Windows and a rather consistent feature observed is that the
Windows curves were smoother at the cache size and did not start decreasing much before the cache size, as was
observed for Linux.
Looping and Sub-Looping
Sub-Looping (8192KB)
Sub-Looping (32768KB)
Linux
Linux
Windows
512 1024 2048 4096 8192
64
Section Size (KB)
Figure 37 : Graphs comparing the Sub-Looping between Linux and Windows
These graphs show that performance of Linux and Windows are comparable for this experiment. The main difference
is that the performance of Linux degrades before the chunk size is equal to the size of the cache. This could be due to
the presence of other resident code in the Linux cache.
7.2 Disk IO
open-close-read-write
The performance of synchronous and asynchronous reads are different for Linux and Windows.
In the case of synchronous reads, Linux performs better than Windows. The bandwidth of Linux drops way beyond
the size of the cache while there is drop in bandwidth for Windows at this point.
Windows performs better than Linux for asynchronous reads. The bandwidth for Linux drops considerably before the
size of the cache, while Windows bandwidth drops around this point.
System Performance Measurements
March 14, 2003
- 25 -
32768
256
16384
128
8192
64
Section Size (KB)
4096
32
2048
16
1024
8
512
4
256
2
128
1
32
0
16
1
5
8
10
4
15
70
60
50
40
30
20
10
0
2
Time(sec)(s)
Time (sec)(s)
20
Windows
Disk Access Read - Asynchronous
Linux
Linux
Windows
Windows
30
MB/sec(s)
35
30
25
20
15
10
5
0
25
20
15
10
5
Access Unit Size KB
Figure 38 : Graphs comparing disk read bandwidths for Linux and Windows
Disk Access Write - Synchronous
Disk Access Write - Asynchronous
Linux
Linux
Windows
30
25
25
15
Access Unit Size KB
52428
26214
13107
8192
4096
2048
1024
512
256
128
64
32
52428
26214
13107
65536
32768
16384
8192
4096
2048
1024
512
256
128
64
32
16
8
4
0
2
5
0
16
10
5
8
10
Windows
20
4
15
2
20
1
MB/sec(s)
30
1
MB/sec(s)
The reads using synchronous I/O definitely show better performance.
Access Unit Size KB
Figure 39 : Graphs comparing disk write bandwidths for Linux and Windows
The performance for synchronous and asynchronous writes are comparable for Linux and Windows.
The bandwidth for synchronous writes drops for Windows at the cache size while that for Linux drops before the
cache size. Windows performs better than Linux for synchronous writes at any size. Beyond the size of the cache, the
performance of Windows continues to prove while that of Linux goes down.
The performance of asynchronous writes for Linux and Windows are comparable except when there is a drop in
bandwidth for Windows at the cache size.
Sequential Vs Random
The performance of Sequential access for Linux is better than Windows for chunk sizes less than 64 KB. Beyond this
size the sequential access bandwidths for both Linux and Windows are comparable.
The random access is similar for both Linux and Windows for all chunk sizes. For chunk sizes less than 64 KB, Linux
performs better than Windows and vice versa for chunk sizes greater than 64KB.
System Performance Measurements
March 14, 2003
- 26 -
52428
26214
13107
32768
65536
65536
8192
16384
32768
4096
2048
512
1024
16384
Access Unit Size KB
256
64
128
32
8
16
4
2
1
52428
26214
13107
65536
32768
8192
16384
4096
2048
512
1024
256
64
128
32
8
16
4
2
0
1
MB/sec(s)
Disk Access Read - Synchronous
Disk Access - Random
Disk Access - Sequential
Linux
Windows
Windows
1000
MB/sec(s)
1000000
100000
10000
1000
100
10
1
100
10
65536
32768
8192
16384
4096
0.1
Chunk Size KB
2048
512
1024
256
64
128
32
8
16
4
2
1
65536
32768
8192
16384
4096
2048
512
1024
256
64
128
32
8
16
4
2
1
1
MB/sec(s)
Linux
Chunk Size KB
Figure 40 : Graphs comparing Sequential and Random disk access between Linux and Windows
Sequential and Random For Page Read Ahead
The break for page-read ahead is observed only for Linux and not on Windows. This could be an indication that the
page-read ahead limit is a feature of each operating system.
Disk Read - Sequential
Linux
Windows
Windows
30000
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
MB/sec(s)
25000
20000
15000
10000
5000
Access Unit Size KB
Access Unit Size KB
Figure 41 : Graphs for comparing the disk accesses for Page Read Ahead between Linux and Windows
7.3 Process
Inline Vs Function Vs Recursion
Inline Vs Function Vs Recursion - With Optimization
Fork- With Optimization
Linux Inline
Windows Inline
Linux Function
Windows Function
Linux Recursion
Windows Recursion
Linux
Windows
100
0.00001
0.01
Iterations
Iterations
Figure 42 : Graphs comparing the times for different call types on Linux and Windows
System Performance Measurements
March 14, 2003
- 27 -
10000
9100
8600
8100
7600
7100
6600
6100
5600
5100
4600
4100
3600
3100
2600
2100
1600
0.1
0.0001
1100
1
600
10000
9100
8600
8100
7600
7100
6600
6100
10
100
0.001
Time(sec)(s)
0.01
5600
5100
4600
4100
3600
3100
2600
2100
1600
1100
600
0.1
100
Time(sec)(s)
1
127
121
115
109
103
97
91
85
79
73
67
61
55
49
43
37
31
25
19
13
7
1
127
121
115
109
103
97
91
85
79
73
67
61
55
49
43
37
31
25
19
7
13
0
1
MB/sec(s)
Linux
Disk Read - Random
For the different call types, Linux performs better than Windows. This can be attributed partly to the ‘cygwin’
environment on which it is know that the fork implementations are not optimal.
8 Conclusion
This project has been an exercise in designing metrics to test certain features of an operating system. We have
probed into some of the mechanisms that control memory access, disk I/O, and process management and have tried
to interpret our observations based on what we know about the corresponding operating system algorithms. We have
also seen differences in behaviour while comparing performances with and without compiler optimization and have
shown that in some cases, a careful choice of compiler optimization, can result in better performance. All these tests
have been at the user level and we have tried to correlate our observations with documented details of the operating
system, to see how much they are actually visible to a user program.
System Performance Measurements
March 14, 2003
- 28 -
Appendix 1
While working on this project we came across a few interesting/surprising things that we did not know about earlier.







Inline implementations of code run faster than implementation by function call, without compiler optimization.
With compiler optimization however, sometimes function call implementations can perform better than inline
implementations.
In simple ‘for’ loops with integer loop indices, the data type of the limiting variable is a significant factor in the
performance of the loop. Also sometimes, higher levels of compiler optimization are not as reliable as one
would expect – we saw this while trying to use a global variable and a high level of compiler optimization, and
although the global variable was getting updated in the code, it did not get updated in the memory and gave
numerically wrong results.
RedHat Linux version 8.0 does not have an explicit file cache. It uses the RAM as a file cache, and stores
data as dirty pages – writing to disk only periodically, or when a process requests more memory than is
currently available as clean pages in the RAM.
On Linux, a cron job runs that triggers ‘updatadb’ every few hours. This is a version of ‘locate’ which
organized file metadata and rearranges files on disk to maximize file access speeds and minimize
fragmentation on disk.
If a process does repeated I/O – large number of repeated disk I/O system calls without any buffering
(reading byte by byte), the process can get trapped in the kernel mode where it is just waiting for hardware
I/O. This sends the process into a ‘disk sleep’ state. This could be related to the ‘standby spindown timeout’
setting via ‘hdparm’ which could send the process into this disk sleep mode. If a process went into this state –
it took extremely long to get out of it. Even a small amount of buffering was enough to get around this.
Zombie processes are created if forked children processes are explicitly terminated and not ‘waited’ for by the
parent. We ran into this when we accidently forked processes and after terminating them explicitly, did not
have a ‘wait’ in the parent. This showed us that the maximum number of processes that can be forked is
approximately 1200.
Working with the Cygwin UNIX Environment on Windows is the first time we have probed the system issues
of a virtual machine implementation. From the documentation on the Cygwin web-site, we learnt the basic
ideas behind how Cygwin emulates a UNIX shell on top of Windows, and what features of Linux have been
ported efficiently and what not. Cygwin emulates a kernel by creating a shared address space for all its
processes and this maintaining file and process information the way Linux does. Process creation meant the
usage of part of this shared space with metadata stored to emulate the file system. Forking is expensive and
uses a non copy-on-write policy, with a large number of explicit parent-child context switches. Regarding
performance – processes are said to perform slower on Cygwin than on Linux, but it is not sure whether this
can be attributed to the inefficiencies of Cygwin or the difference in the Unix and Windows Operating
Systems.
Some additional performance tests that we could think of but have not implemented are as follows.




A test to compare context switch times and process execution times, with sets of processes having different
controlled priorities. This could have been implemented by forking processes and forcing them to multiplex by
performing blocking communication between each other via a pipe, and then observing what happens when
their priorities are changed and controlled using kernel calls like ‘nice’.
A test that uses shared memory with multiple process accessing overlapping regions. This if designed
properly, could be used to probe the system and observe the effect of how shared memory is handled via
locks and monitors.
Multithreaded implementations of a process could have been compared to other implementation schemes.
This could have been a part of the comparison between inline, function call, recursive and forked
implementations of a simple nested computation.
A test to measure or observe the effects of interrupts.
System Performance Measurements
March 14, 2003
- 29 -
References
[1] Maurice J. Bach, “The Design of the Unix Operating System”
[2] Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau, “Information and Control in Gray Box Systems” ,Proc,
Eighteenh ACM Symposium on Operating Systems Principles, pp, 43-56,October 2001.
[3] J.K.OusterHout,”Why Aren’t Operating Systems Getting Faster as Fast as Hardware ?”,Proc, USENIX summer
conference,pp.247-256, June 1990.
[4] Frigo, Leiserson,Prokop,Ramachandran,”Cache Oblivious Algorithms”, MIT Laboratory for Computer Science
[5] http://cygwin.com/usenix-98/cygwin.html
[6] Windows Performance Monitor
[7] RedHat Linux 8.0 manual pages
System Performance Measurements
March 14, 2003
- 30 -
Download