ppt - Computer Sciences Dept.

advertisement
Profile-Driven Selective Program
Loading
Tugrul Ince
tugrul@cs.umd.edu
Jeff Hollingsworth
Department of Computer Science
University of Maryland, College Park, MD 20742
University of Maryland
Motivation

Programs are getting larger!
– Many frameworks and libraries

Many supercomputers lack demand-paging
– Example: Cray XT and BlueGene series
– Available memory is scarce

Observation: Most programs do not use
every available function!
– Frameworks and libraries are too general
– Code that handles errors or special cases

2
Why not remove functions that are not
used in the common case?
University of Maryland
Aim
Reduce memory footprint
by selectively loading
parts of shared libraries
3
University of Maryland
Target Platforms and Applications

Unix/Linux systems that support ELF
– Modifies ELF program headers

Applications with many libraries
– Most current reasonable applications

Parallel programs running on multiple
nodes
– MPI etc.

Platforms without demand-paging
– Cray XT and BlueGene series
4
University of Maryland
Architecture Overview


Application is profiled.
It is rewritten with
– Modified Shared Libraries
– A Signal Handler

5
Application is executed as usual.
University of Maryland
Profiler

Need a list of never-called functions in
each shared library
– Profile the application several times
– May not be perfect

DynInst-based profiler
– Write small program (~ 70 LOC)
– Rewrite shared libraries
– Profile as many times as necessary
6
University of Maryland
Rewriting

Do not load unused functions
.text
– Modify ELF program headers
– Example: libpetsc.so
Program Headers:
Type
LOAD
LOAD
DYNAMIC
GNU_STACK

Offset
0x000000
0x124584
0x112000
0x12459c
0x000000
VirtAddr
0x00000000
0x00125584
0x00112000
0x0012559c
0x00000000
PhysAddr
0x00000000
0x00125584
0x00112000
0x0012559c
0x00000000
FileSiz
0x090000
0x124584
0x013f8
0x012584
0x00130
0x00000
First Loadable Section:
.text, .init, .fini, .plt

7
Second Loadable Section:
.dynamic, .got, .got.plt, .data, .bss
University of Maryland
MemSiz
0x090000
0x124584
0x0a434
0x012584
0x00130
0x00000
Flg
R E
RWE
R
RW
RW
Align
0x1000
0x1000
0x4
0x4
Rewriting

Do not load unused functions
.text
– Modify ELF program headers
– Example: libpetsc.so
Program Headers:
Type
LOAD
LOAD
LOAD
DYNAMIC
GNU_STACK

Offset
0x000000
0x112000
0x124584
0x12459c
0x000000
VirtAddr
0x00000000
0x00112000
0x00125584
0x0012559c
0x00000000
PhysAddr
0x00000000
0x00112000
0x00125584
0x0012559c
0x00000000
FileSiz
0x090000
0x012584
0x013f8
0x00130
0x00000
First Loadable Section:
.text, .init, .fini, .plt

8
Second Loadable Section:
.dynamic, .got, .got.plt, .data, .bss
University of Maryland
MemSiz
0x090000
0x012584
0x0a434
0x00130
0x00000
Flg
R E
R E
RW
RW
RW
Align
0x1000
0x1000
0x1000
0x4
0x4
Rewriting




Rewriter based on DynInst
Profile data is used to create lists of
Used and Unused functions
Access / Modify symbols
Defragment functions to maximize
space savings
– Requires moving functions inside shared
libraries
9
University of Maryland
Function Defragmentation
Used
Unused
10
University of Maryland
Challenges: Relative Calls



Common way of calling functions in PIC.
If either callee or caller is moved, their
relative positioning changes.
Offsets in such relative call instructions
need to be updated
call d
call d’
d
d'
foo
foo
11
University of Maryland
Challenges: Symbols

Runtime linker uses symbols to resolve
cross-library calls.
– Uses procedure linkage tables (plt)

If a function is moved, its associated
symbol has to be updated.
foo: 0xdeadbeef
foo@plt
foo: 0xbeefdead
foo@plt
foo
call foo@plt
12
call foo@plt
foo
University of Maryland
Challenges: Jump Tables


Used to represent n-way branches at
machine level
Targets are read from jump table
– Entries are offsets of targets from the GOT
address



13
Becomes invalid if the function
referenced in a jump table is moved
DynInst reads jump tables to generate
CFGs
We update entries so that they can be
used to point to new location of targets
University of Maryland
Unexpectedly Called Function

Execution is not always predictable
– Unexpected function calls


Rewrite original executable with a Signal
Handler
Load the function upon an unexpected
call
– Signal Handler picks up page faults (SIGSEGV)
– Loads requested page on-demand
– Execution resumes

14
User-level: No OS modifications
University of Maryland
Experiments

Tested on
– PETSc ex5 in snes package
– PETSc ex2 in ksp package
– GS2



Compiled with debug flag and no
optimization
Used Open MPI
Tested on 64-node cluster at UMD
– Dual-core x86 processors
– Unmodified Linux kernel

15
Space savings of about 82% on average
University of Maryland
PETSc – snes (ex5)
Library
Name
Text Pages
(Original)
Text Pages
(Modified)
Reduction
%
petsc
260
68
73.85
petscdm
161
19
88.2
petscksp
335
39
88.36
blas
petscmat
772
40
94.82
petscvec
204
52
74.51
petscsnes
20
20
0
mpi_cxx
10
5
50
142
37
73.94
gfortran
open-pal
62
34
45.16
open-rte
55
34
m
28
3
mpi
Library
Name
Text Pages
(Original)
Text Pages
(Modified)
Reductio
n%
X11
146
7
95.21
lapack
866
2
99.77
80
3
96.25
stdc++
133
12
90.98
gcc_s
12
2
83.33
Xau
2
2
0
Xdcm
3
3
0
123
4
96.75
dl
2
2
0
38.18
nsl
14
2
85.71
89.29
util
2
2
0
2021
348
OVERALL
16
University of Maryland
82.78
PETSc – snes (ex5)
(2021)
1000
900
800
700
600
500
Original
Modified
400
300
200
100
0
17
University of Maryland
PETSc – ksp (ex2)
Library Name
Text Pages
(Modified)
Reduction %
petsc
260
72
72.31
petscdm
161
3
98.14
petscksp
335
49
85.37
petscmat
772
49
93.65
petscvec
204
54
73.53
mpi_cxx
10
5
50
142
47
66.9
open-pal
62
37
40.32
open-rte
55
36
34.55
OVERALL
2001
352
82.41
mpi
18
Text Pages
(Original)
University of Maryland
GS2
Library Name
Reduction %
MdsLib
21
0
100
MdsShr
21
0
100
TdiShr
220
3
98.64
TreeShr
38
0
100
fftw
70
25
64.29
rfftw
58
8
86.21
mpi_f77
13
2
84.62
142
40
71.83
open-pal
62
36
41.94
open-rte
55
36
34.55
OVERALL
700
150
78.57
mpi
19
Text Pages (Original)
Text Pages
(Modified)
University of Maryland
Running Times

GS2 takes 5 seconds less on average
– (36m 38s vs. 36m 33s)

Overhead on PETSc examples
– ex2 runs for 2.7 secs, ex5 runs for 1.05 secs.
20
University of Maryland
Running Times

Results suggest no overhead for
reasonably-long running programs
– Initial cost for signal handler registration
– Better instruction cache and TLB performance
21
University of Maryland
Summary


Our tool reduces memory footprint of
shared libraries
Rewrite shared libraries with holes
– Defragment functions to maximize space
savings



22
On-demand page loading if a not-yetloaded function is called
About 82% memory space savings for
shared libraries
Might improve instruction cache and
TLB performance
University of Maryland
Download