Profiling tools

advertisement
Profiling tools
By Vitaly Kroivets
for Software Design Seminar
Profiling Tools
1
Contents




Introduction
 Software optimization process , optimization traps and
pitfalls
 Benchmark
Performance tools overview
 Optimizing compilers
 System Performance monitors
Profiling tools
 GNU gprof
 INTEL VTune
 Valgrind
What does it mean to use system efficiently
Profiling Tools
2
The Problem

PC speed increased 500 times since 1981, but
today’s software is more complex and still
hungry for more resources
 How to run faster on same hardware and OS
architecture?


Highly optimized applications run tens times faster
than poorly written ones.
Using efficient algorithms and well-designed
implementations leads to high performance
applications
Profiling Tools
3
The Software Optimization
Process
Create benchmark
Hotspots are areas in
your code that take a
long time to execute
Find hotspots
Retest using
benchmark
Investigate causes
Modify application
Profiling Tools
4
Extreme Optimization Pitfalls






Large application’s performance cannot be
improved before it runs
Build the application then see what machine it
runs on
Runs great on my computer…
Debug versus release builds
Performance requires assembly language
programming
Code features first then optimize if there is
time leftover
Profiling Tools
5
Key Point:
Software optimization doesn’t
begin where coding ends –
It is ongoing process that
starts at design stage and
continues all the way through
development
Profiling Tools
6
The Benchmark

The benchmark is program that used to
 Objectively evaluate performance of an application
 Provide repeatable application behavior for use with
performance analysis tools

Industry standard benchmarks :









TPC-C 3D-Winbench
http://www.specbench.com/
Enterprise Services
Graphics/Applications
HPC/OMP
Java Client/Server
Mail Servers
Network File System
Web Servers
Profiling Tools
7
Attributes of good benchmark

Repeatable (consistent measurements)
Remember system tasks , caching issues
 “incoming fax” problem : use minimum
performance number


Representative
Execution of typical code path, mimic how
customer uses the application
 Poor benchmarks : Using QA tests

Profiling Tools
8
Benchmark attributes (cont.)
Easy to run
 Verifiable


need QA for benchmark!
Measure Elapsed Time vs. other number
 Use benchmark to test functionality


Algorithmic tricks to gain performance may
break the application…
Profiling Tools
9
How to find performance bottlenecks





Determine how your system resources, such as
memory and processor, are being utilized to identify
system-level bottlenecks
Measure the execution time for each module and
function in your application
Determine how the various modules running on your
system affect the performance of each other
Identify the most time-consuming function calls and call
sequences within your application
Determine how your application is executing at the
processor level to identify microarchitecture-level
performance problems
Profiling Tools
10
Performance Tools Overview

Timing mechanisms

Stopwatch : UNIX time tool

Optimizing compiler (easy way)
 System load monitors


Software profiler


vmstat , iostat , perfmon.exe, Vtune Counter
Gprof, VTune, Visual C++ Profiler, IBM Quantify
Memory debugger/profiler

Valgrind , IBM Purify, Parasoft Insure++
Profiling Tools
11
Using Optimizing Compilers
Always use compiler optimization
settings to build an application for use
with performance tools
 Understanding and using all the features
of an optimizing compiler is required for
maximum performance with the least
effort

Profiling Tools
12
Optimizing Compiler : choosing
optimization flags combination
Profiling Tools
13
Optimizing Compiler’s effect
Profiling Tools
14
Optimizing Compilers: Conclusions
Some processor-specific options still do
not appear to be a major factor in
producing fast code
 More optimizations do not guarantee
faster code
 Different algorithms are most effective
with different optimizations
 Idea : using statistics gathered by profiler
as input for compiler/linker

Profiling Tools
15
Windows Performance Monitor






Sampling “profiler”
Uses OS timer interrupt to wake up and record
the value of software counters – disk reads,
free memory
Maximum resolution : 1 sec
Cannot identify piece of code that caused
event to occur
Good for finding system issues
Unix tools : vmstat, iostat, xos, top, oprofile,
etc.
Profiling Tools
16
Performance Monitor Counters
Profiling Tools
17
Profilers

Profiler may show time elapsed in each
function and its descendants


number of calls , call-graph (some)
Profilers use either instrumentation or
sampling to identify performance issues
Profiling Tools
18
Sampling vs. Instrumentation
Sampling
Instrumentation
Overhead
Typically about 1%
High, may be 500% !
System-wide
profiling
Yes, profiles all app, drivers, OS functions
Just application and
instrumented DLLs
Detect unexpected
events
Yes , can detect other programs using OS
resources
No
Setup
None
Automatic ins. of data
collection stubs required
Data collected
Counters, processor an OS state
Call graph , call times,
critical path
Data granularity
Assembly level instr., with src line
Functions, sometimes
statements
Detects
algorithmic issues
No, Limited to processes , threads
Yes – can see algorithm,
call path is expensive
Profiling Tools
19
Profiling Tools
Old, buggy and
inaccurate
Gprof
Intel
VTune
Valgrind
$700.
Unstable
Is not profiler
really …
Profiling Tools
20
GNU gprof
Instrumenting profiler for every
UNIX-like system
Profiling Tools
21
Using gprof GNU profiler

Compile and link your program with profiling
enabled
cc -g -c myprog.c utils.c -pg
cc -o myprog myprog.o utils.o -pg

Execute your program to generate a profile
data file



Program will run normally (but slower) and will write
the profile data into a file called gmon.out just
before exiting
Program should exit using exit() function
Run gprof to analyze the profile data

gprof a.out
Profiling Tools
22
Example Program
Profiling Tools
23
Understanding Flat Profile
The flat profile shows the total amount of
time your program spent executing each
function.
 If a function was not compiled for
profiling, and didn't run long enough to
show up on the program counter
histogram, it will be indistinguishable
from a function that was never called

Profiling Tools
24
Flat profile : %time
Percentage of the total execution
time your program spent in this function.
These should all add up to 100%.
Profiling Tools
25
Flat profile: Cumulative seconds
This is cumulative total number of
seconds the spent in this functions, plus the
time spent in all the functions above this one
Profiling Tools
26
Flat profile: Self seconds
Number of seconds accounted
for this function alone
Profiling Tools
27
Flat profile: Calls
Number of times
was invoked
Profiling Tools
28
Flat profile: Self seconds per call
Average number of sec per call
Spent in this function alone
Profiling Tools
29
Flat profile: Total seconds per call
Average number of seconds spent
in this function and its descendents
per call
Profiling Tools
30
Call Graph : call tree of the program
Called by :
main ( )
Descendants:
doit ( )
Current
Function:
g( )
Profiling Tools
31
Call Graph : understanding each line
Unique
index of this
function
Total time propagated
into this function by its
children
Number of times
was called
Current
Function:
g( )
Percentage of the `total‘
time spent in this function
and its children.
total amount of
time spent in
this function
Profiling Tools
32
Call Graph : parents numbers
Time that was propagated
from the function's children
into this parent
Number of times this parent
called the function `/‘
total number of times the
function was called
Call Graph : understanding each line
Time that was propagated
directly from the function
into this parent
Current
Function:
g( )
Profiling Tools
33
Call Graph : “children” numbers
Number of times this function
called the child `/‘
total number of times this
child was called
Current
Function:
g( )
Amount of time that was
propagated directly
from the child into function
Amount of time that was propagated
from the child's children to the function
Profiling Tools
34
How gprof works





Instruments program to count calls
Watches the program running, samples the PC every 0.01
sec
 Statistical inaccuracy : fast function may take 0 or 1
samples
 Run should be long enough comparing with sampling
period
 Combine several gmon.out files into single report
The output from gprof gives no indication of parts of your
program that are limited by I/O or swapping bandwidth. This
is because samples of the program counter are taken at fixed
intervals of run time
number-of-calls figures are derived by counting, not sampling.
They are completely accurate and will not vary from run to
run if your program is deterministic
Profiling with inlining and other optimizations needs care
Profiling Tools
35
VTune performance analyzer
To squeeze every bit of
power out of Intel
architecture !
Profiling Tools
36
VTune Modes/Features

Time- and Event-Based, System-Wide
Sampling provides developers with the most
accurate representation of their software's
actual performance with negligible overhead
 Call Graph Profiling provides developers with
a pictorial view of program flow to quickly
identify critical functions and call sequences
 Counter Monitor allows developers to readily
track system activity during runtime which
helps them identify system level performance
issues
Profiling Tools
37
Sampling mode

Monitors all active software on your
system


including your application, the OS , JITcompiled Java* class files, Microsoft* .NET
files, 16-bit applications, 32-bit applications,
device drivers
Application performance is not impacted
during data collection
Profiling Tools
38
Sampling Mode Benefits

Low-overhead, system-wide profiling helps you identify
which modules and functions are consuming the most
time, giving you a detailed look at your operating
system and application

Benefits of sampling:



Profiling to find hotspots. Find the module, functions,
lines of source code and assembly instructions that are
consuming the most time
Low overhead. Overhead incurred by sampling is
typically about one percent
No need to instrument code. You do not need to make
any changes to code to profile with sampling
Profiling Tools
39
How does sampling work?

Sampling interrupts the processor after a certain
number of events and records the execution
information in a buffer area. When the buffer is full,
the information is copied to a file. After saving the
information, the program resumes operation. In this
way, the VTune™ maintains very low overhead (about
one percent) while sampling



Time-based sampling: collects samples of active instruction
addresses at regular time-based intervals (1ms. by default)
Event-based sampling: collects samples of active
instruction addresses after a specified number of processor
events
After the program finishes, the samples are mapped
to modules and stored in a database within the
analyzer program.
Profiling Tools
40
Starting the Sampling Wizard
Profiling Tools
41
Starting the Sampling Wizard
Hardware
prevents from
sampling of
many counters
simultaneously
Profiling Tools
42
Starting the Sampling Wizard
Profiling Tools
43
Starting the Sampling Wizard
Unsupported
CPU ?
Ha-ha-ha…
Profiling Tools
44
EBS : choosing events
Profiling Tools
45
Events counted by VTune






Basic Events: clock cycles, retired instructions
Instruction Execution: instruction decode,
issue and execution, data and control
speculation, and memory operations
Cycle Accounting Events: stall cycle
breakdowns
Branch Events: branch prediction
Memory Hierarchy: instruction prefetch,
instruction and data caches
System Events: operating system monitors,
instruction and data TLBs
Profiling Tools
46
Sampling …
Profiling Tools
47
Viewing Sampling Results

Process view


Thread view


the threads that ran within the processes you
select in Process view
Module view


all the processes that ran on the system during
data collection
the modules that ran within the selected processes
and threads
Hotspot view

the functions within the modules you select in
Module view
Profiling Tools
48
Different events collected – modules
view
System-wide look at software
running on the system
Our
program
CPIgood
average
indication
Profiling Tools
49
Hotspot Graph
Click on hotspot bar
VTune displays source
code view
Each bar
represents one
of the functions
of our program
Profiling Tools
50
Source View
Test_if
function
Test_if
function
Profiling Tools
51
Annotated Source View(% of module)
See how much time is spent on each one line
Check this
“for” loop !
10% of CPU
spent in few
statements
Profiling Tools
52
VTune Tuning assistant

In few clicks we reached to the performance problem!

Now, how to solve it ?

Tuning Assistant highlights performance problems
 Provides approximate time lost by each performance
problem
 Database contains performance metrics based on
Intel’s experience of tuning hundreds of applications



Analyzes the data gathered by our application
Generates tuning recommendations for each “hotspot”
Gives user idea what might be done to fix the problem
Profiling Tools
53
Tuning Assistance Report
Profiling Tools
54
Hotspot Assistant Report : Penalties
Profiling Tools
55
Hotspot Assistant Report
Profiling Tools
56
Call Graph Mode

Provides with a pictorial view of program flow
to quickly identify critical functions and call
sequences
 Call graph profiling reveals:




Structure of your program on a function level
Number of times a function is called from a
particular location
The time spent in each function
Functions on a critical path.
Profiling Tools
57
Call Graph Screenshot
the
function
summary
pane
Critical Path displayed as red lines:
call sequence in an application that
took the most time to execute.
Switch to Calllist View
Profiling Tools
58
Call Graph (Cont.)
Wait time
– how much time spent
waiting for event to
occur
Additional info available
- by hovering the move over
the functions
Profiling Tools
59
Jump to Source view
Profiling Tools
60
Call Graph – Call List View
Caller Functions
are the functions
that called the
Focus Function
Callee Functions
are the functions
that called by
Focus Function
Profiling Tools
61
Counter Monitor


Use the Counter Monitor feature of the VTune™ to
collect and display performance counter data. Counter
monitor selectively polls performance counters, which
are grouped categorically into performance objects.
With the VTune analyzer, you can:
 Monitor selected counters in performance objects.
 Correlate performance counter data with data
collected by other features in the VTune analyzer,
such as sampling.
 Trigger the collection of counter data on events other
than a periodic timer.
Profiling Tools
62
Counter Monitor
Profiling Tools
63
Getting Help
•Context –sensitive help
•Online Help repository
Profiling Tools
64
VTune Summary
Pros: Allows to get best possible
performance out of Intel architecture
 Cons: Extreme tuning requires deep
understanding of processor and OS
internals

Profiling Tools
65
Valgrind
Multi-purpose Linux x86 profiling
tool
Profiling Tools
66
Valgrind Toolkit

Memcheck is memory debugger


Cachegrind is a cache profiler


performs detailed simulation of the I1, D1 and L2
caches in your CPU
Massif is a heap profiler


detects memory-management problems
performs detailed heap profiling by taking regular
snapshots of a program's heap
Helgrind is a thread debugger


finds data races in multithreaded
programs
Profiling Tools
67
Memcheck Features

When a program is run under Memcheck's supervision, all reads
and writes of memory are checked, and calls to
malloc/new/free/delete are intercepted

Memcheck can detect:









Use of uninitialised memory
Reading/writing memory after it has been free'd
Reading/writing off the end of malloc'd blocks
Reading/writing inappropriate areas on the stack
Memory leaks -- where pointers to malloc'd blocks are lost forever
Passing of uninitialised and/or unaddressible memory to system
calls
Mismatched use of malloc/new/new [] vs free/delete/delete []
Overlapping src and dst pointers in memcpy() and related functions
Some misuses of the POSIX pthreads API
Profiling Tools
68
Memcheck Example
Access of
unallocated
memory
Using noninitialized
value
Memor
y leak
Profiling Tools
Using “free” of
memory
allocated by
“new”
69
Memcheck Example (Cont.)

Compile the program with –g flag:


Execute valgrind :


g++ -c a.cc –g –o a.out
Debug
leaks
valgrind --tool=memcheck --leak-check=yes a.out > log
View log
Executabl
e name
Profiling Tools
70
Memcheck report
Profiling Tools
71
Memcheck report (cont.)
Leaks detected:
S
T
A
C
K
Profiling Tools
72
Cachegrind

Detailed cache profiling can be very useful for
improving the performance of the program





On a modern x86 machine, an L1 miss will cost around
10 cycles, and an L2 miss can cost as much as 200
cycles
Cachegrind performs detailed simulation of the I1, D1
and L2 caches in your CPU
Can accurately pinpoint the sources of cache misses
in your code
Identifies number of cache misses, memory
references and instructions executed for each line of
source code, with per-function, per-module and wholeprogram summaries
Cachegrind runs programs about 20--100x slower than
normal
Profiling Tools
73
How to run

Run valgrind --tool=cachegrind in front of the
normal command line invocation

Example : valgrind --tool=cachegrind ls -l

When the program finishes, Cachegrind will
print summary cache statistics. It also collects
line-by-line information in a file
cachegrind.out.pid
 Execute cg_annotate to get annotated source
Source files
file:

cg_annotate --7618 a.cc > a.cc.annotated
PID
Profiling Tools
74
Cachegrind Summary output
I-cache reads
(instructions executed)
I1 cache read misses
Instruction caches
performance
L2-cache instruction
read misses
Profiling Tools
75
Cachegrind Summary output
D-cache reads
(memory reads)
D1 cache read misses
Data caches
READ performance
L2-cache data
read misses
Profiling Tools
76
Cachegrind Summary output
D-cache writes
(memory writes)
D1 cache write
misses
Data caches
WRITE performance
L2-cache data
write misses
Profiling Tools
77
Cachegrind Accuracy

Valgrind's cache profiling has a number of
shortcomings:



It doesn't account for kernel activity -- the effect of
system calls on the cache contents is ignored
It doesn't account for other process activity
(although this is probably desirable when
considering a single program)
It doesn't account for virtual-to-physical address
mappings; hence the entire simulation is not a true
representation of what's happening in the cache
Profiling Tools
78
Massif tool



Massif is a heap profiler - it measures how much heap
memory programs use. It can give information about:
 Heap blocks
 Heap administration blocks
 Stack sizes
Help to reduce the amount of memory the program uses
 smaller program interact better with caches, avoid
paging
Detect leaks that aren't detected by traditional leakcheckers, such as Memcheck
 That's because the memory isn't ever actually lost - a
pointer remains to it - but it's not in use anymore
Profiling Tools
79
Executing Massif

Run valgrind –tool=massif prog

Produces following:




Summary
Graph Picture
Report
Summary will look like this:




Space (in bytes)
multiplied by
time (in
milliseconds).
Total spacetime: 2,258,106 ms.B
Heap: 24.0%
number of words
allocated on
Heap admin: 2.2%
heap, via
malloc(), new
Stack (s): 73.7%
and new[].
Profiling Tools
80
Spacetime Graphs
Profiling Tools
81
Spacetime Graph (Cont.)
Each band represents single line of source
code
 It's the height of a band that's important
 Triangles on the x-axis show each point at
which a memory census was taken



Not necessarily evenly spread; Massif only takes a
census when memory is allocated or de-allocated
The time on the x-axis is wall-clock time
 not ideal because can get different graphs for
different executions of the same program, due to
random OS delays
Profiling Tools
82
Text/HTML Report example
Contains a lot of extra information about heap allocations that you
don't see in the graph.
Shows places in
the program where
most memory was
allocated
Profiling Tools
83
Valgrind – how it works

Valgrind is compiled into a shared object, valgrind.so. The shell
script valgrind sets the LD_PRELOAD environment variable to
point to valgrind.so. This causes the .so to be loaded as an extra
library to any subsequently executed dynamically-linked ELF
binary

The dynamic linker allows each .so in the process image to have
an initialization function which is run before main(). It also allows
each .so to have a finalization function run after main() exits

When valgrind.so's initialization function is called by the dynamic
linker, the synthetic CPU to starts up. The real CPU remains
locked in valgrind.so until end of run

System call are intercepted; Signal handlers are monitored
Profiling Tools
84
Valgrind Summary




Valgrind will save hours of debugging time
Valgrind can help speed up your programs
Valgrind runs on x86-Linux
Valgrind works with programs written in any language



Valgrind can be used with other tools (gdb)
Valgrind is easy to use


uses dynamic binary translation, so no need to modify,
recompile or re-link applications. Just prefix command
line with valgrind and everything works
Valgrind is not a toy


Valgrind is actively maintained
Used by large projects : 25 millions lines of code
Valgrind is free
Profiling Tools
85
Other Tools

Tools not included in this presentation:
IBM Purify
 Parasoft Insure
 KCachegrind
 Oprofile
 GCC’s and GLIBC’s debugging hooks

Profiling Tools
86
Writing Fast Programs

Select right algorithm
 Implement it efficiently

Detect hotspots using profiler and fix them



Understanding of target system architecture is often
required – such as cache structure
Use platform-specific compiler extensions – memory
pre-fetching, cache control-instruction, branch
prediction, SIMD instructions
Write multithreaded applications (“Hyper Threading
Technology”)
Profiling Tools
87
CPU Architecture (Pentium 4)
Branch
prediction
Instruction
fetch
Instruction
decode
Instruction
pool
retirement
Execution
Units
Memory
Profiling Tools
88
Instruction Execution
Execution Units
Integer
Integer
Instruction
pool
Floating point
Dispatch unit
Floating point
Memory Load
Memory Save
Profiling Tools
89
Keeping CPU Busy

Processors are limited by data dependencies and
speed of instructions




Keep data dependencies low
Good blend of instructions keep all execution units
busy at same time
Waiting for memory with nothing else to execute is
most common reason for slow applications
Goals: ready instructions, good mix of instructions and
predictable branches


Remove branches if possible
Reduce randomness of branches, avoid function
pointers and jump tables
Profiling Tools
90
Memory Overview (Pentium 4)

L1 cache (data only) 8 kbytes

Execution Trace Cache that stores up to
12K of decoded micro-ops
L2 Advanced Transfer Cache (data +
instructions) 256 kbytes, 3 times slower
than L1
 L3 : 4MB cache (optional)
 Main RAM (usually 64M … 4G) , 10
times slower than L1

Profiling Tools
91
Fixing memory problems







Use less memory to reduce compulsory cache
misses
Increase cache efficiency (place items used at
same time near each other)
Read sooner with prefetch
Write memory faster without using cache
Avoid conflicts
Avoid capacity issues
Add more work for CPU (execute nondependent instruction while waiting)
Profiling Tools
92
References
 SPEC website http://www.specbench.org
 The Software Optimization Cookbook

High-Performance Recipes for the Intel® Architecture
by Richard Gerber
GCC Optimization flags http://gcc.gnu.org/onlinedocs/gcc/OptimizeOptions.html



Valgrind Homepage http://valgrind.kde.org
An Evolutionary Analysis of GNU C Optimizations Using
Natural Selection to Investigate Software Complexities
by Scott Robert Ladd
Intel VTune Performace Analyzer webpage
http://www.intel.com/software/products/vtune/

Gprof man page
http://www.gnu.org/software/binutils/manual/gprof-2.9.1/html_mono/gprof.html
Profiling Tools
93
Questions?
Profiling Tools
94
Download