Copyright © 2012, Oracle and/or its affiliates. All rights reserved. 1 Insert Information Protection Policy Classification from Slide 13 Maximizing Your SPARC T5 Oracle Solaris Application Performance Darryl Gove Senior Principal Software Engineer Copyright © 2012, Oracle and/or its affiliates. All rights reserved. 2 Insert Information Protection Policy Classification from Slide 13 Program Agenda Hardware Correctness Performance Parallelism Copyright © 2012, Oracle and/or its affiliates. All rights reserved. 3 Insert Information Protection Policy Classification from Slide 13 Oracle Solaris Studio Compiler Suite Analysis Suite C, C++ Compilers utilize advanced code generation technology to optimize apps for highest performance on SPARC & x86 Performance Analyzer provides unparalleled insight into your app, allowing you to identify bottlenecks and improve performance by orders of magnitude Fortran Compiler optimizes compute intensive app performance Code Analyzer ensures app reliability by detecting app vulnerabilities, including memory leaks and New memory access violations Debugger ensures app stability with event handling & multi-thread support Performance Library maximizes computeintensive app performance using advanced numeric solver libraries © 2011 Oracle Corporation – Proprietary and Confidential Thread Analyzer simplifies complex parallel programming errors by detecting hard to pinpoint race and deadlock conditions Integrated Development Environment increases developer efficiency 4 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 4 Oracle Solaris Studio 12.3 Highlights Accelerate Performance Gain Extreme Observability Improve Productivity 5 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 5x faster code on SPARC T5 1.5x faster code on Intel x86 New Code Analyzer for more reliable applications; reports common coding & memory access errors faster than competitive alternatives Enhanced Performance Analyzer with system-wide performance analysis Remote access to Solaris Studio tools from local desktop (Oracle Solaris, Linux, Microsoft Windows, Mac) Streamlined Oracle DB application development Simplify Oracle Tuxedo development with IDE plug-in IPS distribution on Solaris 11 for simplified management 20% faster compile time Oracle Solaris Studio 12.3, 1/13 PSE Delivers compiler optimisations resulting in the fastest code on the new Oracle T5, Oracle M5 and Fujitsu M10 systems Up to 5x faster than GCC Up to 10% faster than Oracle Solaris Studio 12.3 IPS or SVR4 package update to Oracle Solaris Studio 12.3 Available for customers with the Solaris Development Tools Support contract More information: Article ID 1519949.1 on My Oracle Support 6 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Click icon to add picture SPARC T5 Hardware Copyright © 2012, Oracle and/or its affiliates. All rights reserved. 7 Insert Information Protection Policy Classification from Slide 13 SPARC T5 - Overview Enhancement over T4 More threads Faster clock speed Larger third level cache T5 and T4: 8 Unrelated to T1 – T3 (only share the T-series name) Enhanced multithread throughput Enhanced single thread performance Copyright © 2013, Oracle and/or its affiliates. All rights reserved. SPARC T5 - Details 1 to 8 chips per system 16 cores per chip Dual issue Out-of-order 8 threads per core 3.6 GHz clock 9 115B (3.6 GHz * 16 * 2) instructions / sec / chip Copyright © 2013, Oracle and/or its affiliates. All rights reserved. SPARC T5 - Capacity Chip capacity: 115 B instructions / sec For fully active threads: Single thread: 7.2 B instructions / sec Each of eight threads: 0.9 B instructions / sec Threads rarely fully active: 10 I/O wait Processor stall (fetch from memory = 300-400 cycles) Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Developing for T5 Make it correct Remove obvious performance issues Make it scale (correctly) 11 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Click icon to add picture Application Correctness Copyright © 2012, Oracle and/or its affiliates. All rights reserved. 12 Insert Information Protection Policy Classification from Slide 13 Debug information Always use -g No optimisation flags: Full debug Lower performance Optimised binaries: Best effort debug No/minimal performance impact Debug what you ship! 13 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Automatic Error Detection Static/compile time error detection Code Analyzer Dynamic/runtime memory access error detection 14 Discover Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Code Analyzer Static analysis for common coding errors Uninitialised variables, etc. Compile with: -xanalyze=code View results with: 15 code-analyzer <a.out> Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Code Analyzer – example output 16 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Memory Error Detection - discover Common memory allocation and use errors: Uninitialised memory Access past bounds Memory leaks Usage: 17 discover <a.out> <a.out> Default = html output Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Example of discover $ ./a.out ERROR 1 (ABR): reading memory beyond array bounds at address 0xffbff278 (8 bytes) on the stack at: average() + 0x228 <disc.c:8> 6: for (int i=1; i<=len; i++) 7: { 8:=> total+=array[i]; 9: } _start() + 0xd8 ... double array[20]; ... printf(" Average = %f\n", average(array,20) ); 18 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Click icon to add picture Application Performance Copyright © 2012, Oracle and/or its affiliates. All rights reserved. 19 Insert Information Protection Policy Classification from Slide 13 Optimisation – the Basics No optimisation flags == no optimisation Good optimisation: -O Advanced optimisations: 20 Guided by profile of appliaction Knowledge of deployment systems Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Profiling Profiling with the performance analyzer collect <a.out> collect -P <pid> analyzer test.1.er Report generation with spot 21 spot <a.out> spot -P <pid> Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Performance Analyzer Demo 22 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Aggressive Optimisation One stop flag: -fast Enables multiple optimisations Build machine = deployment machine Floating point simplification and optimisation Pointers to different types do not alias Function inlining Investigate performance gain 23 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Profile Drives Flag Selection Floating Point Significant time in floating point computation: Floating point simplification -fsimple=2 Significant time in floating point library code: Optimised floating point libraries -xlibmopt, -xlibmil Use FP optimisations if performance improves and FP optimisations are acceptable 24 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Profile Drives Flag Selection Flat profile Many hot small functions At least -xO4 optimisation level -xipo for cross-file optimisations Conditional code or inlining 25 Profile feedback -xprofile=collect: Training run of application -xprofile=use: Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Profile Drives Flag Selection Pointers Pointers inhibit compiler optimisations Compiler needs more information restrict qualified pointers in C Localised action Flags: 26 -xrestrict (restrict qualified pointers passed into functions) -xalias_level=std [C] -xalias_level=compatible [C++] Actions at file level Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Processor Specific Optimisations Default: -xtarget=generic often good enough T4/T5 have useful instructions: Compare and branch Floating point multiply add One stop flag: -xtarget=T5 Schedules for T5, uses entire T4 and T5 instruction set Only runs on T4, T5, or later processors 27 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. SPARC Instruction Sets 28 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Click icon to add picture Multi-threaded Applications Copyright © 2012, Oracle and/or its affiliates. All rights reserved. 29 Insert Information Protection Policy Classification from Slide 13 Multi-thread or Multi-process Multiprocess: Isolation Independence Large virtual memory footprint Potentially high synchronisation costs Throughput Multithread 30 Low synchronisation costs Minimal memory footprint Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Latency Multi-threaded Application Development POSIX threads (C11, C++11) Low level: Great control, significant complexity OpenMP High abstraction: Easy to use, flexible Automatic parallelisation 31 Trivial to use: -xautopar -xreduction Works best for loop-intensive code (typically FP) Copyright © 2013, Oracle and/or its affiliates. All rights reserved. OpenMP Parallel For Distributes iterations across CPUs #pragma omp parallel for for (int i=0; i<length; i++) { // Do work } 32 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. OpenMP Tasks Distributes work across CPUs for (int i=0; i<length; i++) { #pragma omp task { // Do work for task ‘i’ } } 33 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Parallel Program Correctness Distributes work across CPUs int total=0; #pragma omp parallel for for (int i=0; i<length; i++) { total += i; } Data race: Multiple threads updating the same variable 34 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Thread Analyzer Instrument application Compiler flag: -xinstrument=datarace Binary instrumentation: discover -i datarace Gather data: collect -r on <a.out> View data: 35 tha tha.1.er Copyright © 2013, Oracle and/or its affiliates. All rights reserved. <a.out> Thread Analyzer - Example Demo 36 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Scaling to Many Threads Minimise serial code Amdahl’s Law Minimise lock contention Minimise writes of shared data Evenly distribute work 37 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Scaling to Many Threads Demo 38 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Limits of Performance Threads vmstat Instruction Issue Width pgstat / cputrack / cpustat / ripc Bandwidth 39 busstat / bw Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Conclusion: Optimising for T5 Step 1: Profile and remove inefficient code Step 2: Explore benefits of increased optimisation Step 3: Identify opportunities for parallelisation Step 4: Profile and tune parallel code Step 5: Watch for hitting hardware limits 40 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. 41 Insert Information Protection Policy Classification from Slide 13