Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
1
Insert Information Protection Policy Classification from Slide 13
Maximizing Your SPARC T5
Oracle Solaris Application
Performance
 Darryl Gove
Senior Principal Software Engineer
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
2
Insert Information Protection Policy Classification from Slide 13
Program Agenda
 Hardware
 Correctness
 Performance
 Parallelism
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
3
Insert Information Protection Policy Classification from Slide 13
Oracle Solaris Studio
Compiler Suite
Analysis Suite
C, C++ Compilers utilize advanced code
generation technology to optimize apps for highest
performance on SPARC & x86
Performance Analyzer provides unparalleled insight
into your app, allowing you to identify bottlenecks and
improve performance by orders of magnitude
Fortran Compiler optimizes compute intensive app
performance
Code Analyzer ensures app reliability by detecting
app vulnerabilities, including memory leaks and
New
memory
access violations
Debugger ensures app stability with event handling
& multi-thread support
Performance Library maximizes computeintensive app performance using advanced numeric
solver
libraries
© 2011
Oracle
Corporation – Proprietary and Confidential
Thread Analyzer simplifies complex parallel
programming errors by detecting hard to pinpoint
race and deadlock conditions
Integrated Development Environment increases developer efficiency
4
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
4
Oracle Solaris Studio 12.3 Highlights
Accelerate
Performance



Gain Extreme
Observability


Improve
Productivity




5
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
5x faster code on SPARC T5
1.5x faster code on Intel x86
New Code Analyzer for more reliable applications;
reports common coding & memory access errors faster
than competitive alternatives
Enhanced Performance Analyzer with system-wide
performance analysis
Remote access to Solaris Studio tools from local
desktop (Oracle Solaris, Linux, Microsoft Windows, Mac)
Streamlined Oracle DB application development
Simplify Oracle Tuxedo development with IDE plug-in
IPS distribution on Solaris 11 for simplified management
20% faster compile time
Oracle Solaris Studio 12.3, 1/13 PSE
 Delivers compiler optimisations resulting in the fastest code on the
new Oracle T5, Oracle M5 and Fujitsu M10 systems

Up to 5x faster than GCC

Up to 10% faster than Oracle Solaris Studio 12.3
 IPS or SVR4 package update to Oracle Solaris Studio 12.3
 Available for customers with the Solaris Development Tools Support
contract
 More information: Article ID 1519949.1 on My Oracle Support
6
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Click icon to add picture
SPARC T5 Hardware
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
7
Insert Information Protection Policy Classification from Slide 13
SPARC T5 - Overview
 Enhancement over T4

More threads

Faster clock speed

Larger third level cache
 T5 and T4:
8

Unrelated to T1 – T3 (only share the T-series name)

Enhanced multithread throughput

Enhanced single thread performance
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
SPARC T5 - Details
 1 to 8 chips per system
 16 cores per chip

Dual issue

Out-of-order
 8 threads per core
 3.6 GHz clock

9
115B (3.6 GHz * 16 * 2) instructions / sec / chip
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
SPARC T5 - Capacity
 Chip capacity: 115 B instructions / sec
 For fully active threads:

Single thread: 7.2 B instructions / sec

Each of eight threads: 0.9 B instructions / sec
 Threads rarely fully active:
10

I/O wait

Processor stall (fetch from memory = 300-400 cycles)
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Developing for T5
 Make it correct
 Remove obvious performance issues
 Make it scale (correctly)
11
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Click icon to add picture
Application Correctness
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
12
Insert Information Protection Policy Classification from Slide 13
Debug information
 Always use -g
 No optimisation flags:

Full debug

Lower performance
 Optimised binaries:

Best effort debug

No/minimal performance impact
 Debug what you ship!
13
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Automatic Error Detection
 Static/compile time error detection

Code Analyzer
 Dynamic/runtime memory access error detection

14
Discover
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Code Analyzer
 Static analysis for common coding errors

Uninitialised variables, etc.
 Compile with:

-xanalyze=code
 View results with:

15
code-analyzer <a.out>
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Code Analyzer – example output
16
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Memory Error Detection - discover
 Common memory allocation and use errors:

Uninitialised memory

Access past bounds

Memory leaks
 Usage:
17

discover <a.out>

<a.out>

Default = html output
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Example of discover
$ ./a.out
ERROR 1 (ABR): reading memory beyond array bounds at address
0xffbff278 (8 bytes) on the stack at:
average() + 0x228 <disc.c:8>
6:
for (int i=1; i<=len; i++)
7:
{
8:=>
total+=array[i];
9:
}
_start() + 0xd8
...
double array[20];
...
printf(" Average = %f\n", average(array,20) );
18
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Click icon to add picture
Application Performance
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
19
Insert Information Protection Policy Classification from Slide 13
Optimisation – the Basics
 No optimisation flags == no optimisation
 Good optimisation: -O
 Advanced optimisations:
20

Guided by profile of appliaction

Knowledge of deployment systems
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Profiling
 Profiling with the performance analyzer

collect <a.out>

collect -P <pid>

analyzer test.1.er
 Report generation with spot
21

spot <a.out>

spot -P <pid>
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Performance Analyzer
 Demo
22
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Aggressive Optimisation
 One stop flag: -fast
 Enables multiple optimisations

Build machine = deployment machine

Floating point simplification and optimisation

Pointers to different types do not alias

Function inlining
 Investigate performance gain
23
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Profile Drives Flag Selection
Floating Point
 Significant time in floating point computation:

Floating point simplification

-fsimple=2
 Significant time in floating point library code:

Optimised floating point libraries

-xlibmopt, -xlibmil
 Use FP optimisations if performance improves and FP optimisations
are acceptable
24
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Profile Drives Flag Selection
Flat profile
 Many hot small functions

At least -xO4 optimisation level

-xipo for cross-file optimisations
 Conditional code or inlining
25

Profile feedback

-xprofile=collect:

Training run of application

-xprofile=use:
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Profile Drives Flag Selection
Pointers
 Pointers inhibit compiler optimisations
 Compiler needs more information
 restrict qualified pointers in C

Localised action
 Flags:
26

-xrestrict (restrict qualified pointers passed into functions)

-xalias_level=std [C]

-xalias_level=compatible [C++]

Actions at file level
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Processor Specific Optimisations
 Default: -xtarget=generic often good enough
 T4/T5 have useful instructions:

Compare and branch

Floating point multiply add
 One stop flag: -xtarget=T5
 Schedules for T5, uses entire T4 and T5 instruction set
 Only runs on T4, T5, or later processors
27
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
SPARC Instruction Sets
28
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Click icon to add picture
Multi-threaded
Applications
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
29
Insert Information Protection Policy Classification from Slide 13
Multi-thread or Multi-process
 Multiprocess:

Isolation

Independence

Large virtual memory footprint

Potentially high synchronisation costs
Throughput
 Multithread
30

Low synchronisation costs

Minimal memory footprint
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Latency
Multi-threaded Application Development
 POSIX threads (C11, C++11)

Low level: Great control, significant complexity
 OpenMP

High abstraction: Easy to use, flexible
 Automatic parallelisation
31

Trivial to use: -xautopar -xreduction

Works best for loop-intensive code (typically FP)
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
OpenMP Parallel For
 Distributes iterations across CPUs
#pragma omp parallel for
for (int i=0; i<length; i++)
{
// Do work
}
32
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
OpenMP Tasks
 Distributes work across CPUs
for (int i=0; i<length; i++)
{
#pragma omp task
{
// Do work for task ‘i’
}
}
33
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Parallel Program Correctness
 Distributes work across CPUs
int total=0;
#pragma omp parallel for
for (int i=0; i<length; i++)
{
total += i;
}
 Data race: Multiple threads updating the same variable
34
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Thread Analyzer
 Instrument application

Compiler flag: -xinstrument=datarace

Binary instrumentation: discover -i datarace
 Gather data:

collect -r on <a.out>
 View data:

35
tha tha.1.er
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
<a.out>
Thread Analyzer - Example
 Demo
36
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Scaling to Many Threads
 Minimise serial code

Amdahl’s Law
 Minimise lock contention
 Minimise writes of shared data
 Evenly distribute work
37
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Scaling to Many Threads
 Demo
38
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Limits of Performance
 Threads

vmstat
 Instruction Issue Width

pgstat / cputrack / cpustat / ripc
 Bandwidth

39
busstat / bw
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Conclusion: Optimising for T5
 Step 1: Profile and remove inefficient code
 Step 2: Explore benefits of increased optimisation
 Step 3: Identify opportunities for parallelisation
 Step 4: Profile and tune parallel code
 Step 5: Watch for hitting hardware limits
40
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
41
Insert Information Protection Policy Classification from Slide 13