Three Questions every one keeps asking Stephen Blair-Chappell Intel Compiler Labs Three Common Requests “How can I make my program run faster?” “How can I make my program parallel?” “Will my code run on any CPU? - compatibility” 2 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Parallel Studio XE Intel® Composer XE • Amplifier XE • Profiler Composer XE • Compiler • Libraries • Intel® VTune™ Amplifier XE • • Inspector XE • Memory Errors • Parallel Errors Use to generate fast, safe, parallel code (C/C++, Fortran) Find hotspots and bottlenecks in you code. Intel® Inspector XE • • Use to find memory and threading errors Three Components 3 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Parallel Studio XE Amplifier XE •Profiler Composer XE + Advisor Intel® Parallel Advisor • •Compiler •Libraries • Use to model parallelism in your existing applications Inspector XE •Memory Errors •Parallel Errors Intel® Composer XE • • Use to generate fast, safe, parallel code (C/C++, Fortran) Intel® VTune™ Amplifier XE • • Find hotspots and bottlenecks in you code. Intel® Inspector XE • • Use to find memory and threading errors Four Three Components 4 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Three Common Requests “How can I make my program run faster?” “How can I make my program parallel?” “Will my code run on any CPU? - compatibility” 5 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. The compiler uses many optimisation techniques Faster Code fast floating point http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html http://software.intel.com/sites/products/collateral/hpc/compilers/compiler_qrg12.pdf 6 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Faster Code Often we are happy with out-of- the-box experience When was the last time you looked at some documentation? 7 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. The Seven Optimisation Steps Step 1 start Build with optimization disabled Step 2 Use General Optimizations Step 3 Use Processor-Specific Options Faster Code Example options Windows (Linux) /Od (-O0) /01,/02,/03 (-O1, -O2, -O3) /QxSSE4.2 /QxHOST (-xsse4.2) (-xhost) /Qipo (-ipo) /Qprof-gen /Qprof-use (-prof-gen) (-prof-use) /Qguide (-guide) Step 4 Add Inter-procedural Step 5 Use Profile Guided Optimization Step 6 Tune automatic vectorization Step 7 Implement Parallelism or use Automatic Parallelism Use Intel Family of Parallel Models /Qparallel (-parallel) Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Vectorisation is … Faster Code 1999 2000 2004 2006 2007 2008 2009 2011 2012\2013 2012 SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AES-NI AVX AVX2 MIC 70 instr 144 instr 13 instr 8 instr 7 instr Doubleprecision Vectors Complex Data 32 instr 47 instr SinglePrecision Vectors Decode Video String/XML processing Encryption and Decryption ~100 new instr. Int. AVX expands to 256 bit 512-bit vector Streaming operations 8/16/32 64/128-bit vector integer Graphics building blocks Advanced vector instr POP-Count CRC Key Generation 256-bit vector 3 and 4operand instructions a[3] for (i=0;i<MAX;i++) c[i]=a[i]+b[i]; + Improved bit manip. fma Vector shifts Gather a[1] a[2] + a[0] + + b[3] b[2] b[1] b[0] c[3] c[2] c[1] c[0] 9 8/2/2012 ~300 legacy sse instr updated Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Different Ways of Inserting Vectorised Code Use Performance Libraries (e.g. IPP and MKL) Compiler: Fully automatic vectorization Cilk Plus Array Notation Compiler: Auto vectorization hints (#pragma ivdep, …) User Mandated Vectorization ( SIMD Directive) Manual CPU Dispatch (__declspec(cpu_dispatch …)) SIMD intrinsic class (F32vec4 add) Vector intrinsic (mm_add_ps()) Assembler code (addps) 10 8/2/2012 Ease of use Faster Code Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Programmer control An example Faster Code Speedup by upgrading silicon Speedup by swapping compiler Verified using VTune 11 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Three Common Requests “How can I make my program run faster?” “How can I make my program parallel?” “Will my code run on any CPU? - compatibility” 12 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Speedup using parallelism Parallel Code Analyze Implement Debug Implement Tune Compiler 13 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Tune concurrency Four Step Development 4 Amplifier XE IPP Debug Memory Libraries MKL TBB 3 Threads OpenMP Inspector XE Cilk Plus Locks & waits 2 Composer XE EBS (XE only) Hotspot Analyze Amplifier XE 1 Four Different Ways to Find the Hotspots 1. Using Intel compiler’s profile viewer 2. Using the compiler’s 3. Using loop profiler & Auto-parallelizer Amplifier XE 4. Performing a Survey with Advisor 14 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Parallel Code Analyze Implement Debug Tune Language to help parallelism Intel® Parallel Code #pragma omp parallel for for(i=1;i<=4;i++) { printf(“Iter: %d”, i); } Cilk™ Plus OpenMP Intel® Threading Building Blocks Intel® MPI Fortran Coarrays OpenCL cilk_for (int i = 0; i < max_row; i++) { for (int j = 0; j < max_col; j++ ) { p[i][j] = mandel( complex(scale(i), scale(j))); } Native Threads } 15 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Four Different Ways to Find your Parallel Errors 1. Using Inspector XE 2. Perform a Static Security Analysis Parallel Code Analyze Implement Debug Tune 3. Debug with 4. Use Parallel Debug Extensions Advisor 16 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. An example … Parallel Code 3 1 2 4 5 1. Hotspot Analysis 2. Implement 3. Find Threading Errors 4,5,6. Tune Parallelism 6 https://makebettercode.com/parallel_landing_required/lib/pdf/5373_IN_ParallelMag_Sudoku_060911.pdf 17 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Three Common Requests “How can I make my program run faster?” “How can I make my program parallel?” “Will my code run on any CPU? - compatibility” 18 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Will my program run on any CPU? Compatible Code Compatibility • run? Future Proofing • build? OS-agnostic CPU-agnostic Language / Standards Tools Scalability • Performance? 19 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Vectorised Parallel On the graphs, bigger is better 20 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Running Example: Monte Carlo #pragma omp parallel for for(int opt = 0; opt < OPT_N; opt++) { float VBySqrtT = VOLATILITY * sqrtf(T[opt]); float MuByT = (RISKFREE ‐ 0.5f * VOLATILITY * VOLATILITY) * T[opt]; float Sval = S[opt]; float Xval = X[opt]; float val = 0.0f, val2 = 0.0f; #pragma simd reduction(+:val) reduction(+:val2) for(int pos = 0; pos < RAND_N; pos++){ float callValue = expectedCall(Sval, Xval, MuByT, VBySqrtT, l_Random[pos]); val += callValue; val2 += callValue * callValue; } float exprt = expf(‐RISKFREE *T[opt]); h_CallResult[opt] = exprt * val / (float)RAND_N; float stdDev = sqrtf(((float)RAND_N*val2 ‐ val*val) / ((float)RAND_N*(float)(RAND_N – 1.f))); h_CallConfidence[opt] =(float)(exprt * 1.96f * stdDev/sqrtf((float)RAND_N)); } SFTL003 hands on lab Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013 Helping Developers Efficiently Produce Fast, Scalable and Reliable Applications More Cores. Wider Vectors. Performance Delivered. Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013 More Cores Scaling Performance Efficiently Multicore Many-core 50+ cores Wider Vectors 128 Bits Serial Performance • Industry-leading performance from advanced compilers Task & Data Parallel Performance • Comprehensive libraries 256 Bits 512 Bits • Parallel programming models Distributed Performance • Insightful analysis tools 23 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013 Phase Product Intel® Advisor XE Build Verify & Tune Feature Threading design assistant (Studio products only) † Benefit • Simplifies, demystifies, and speeds parallel application design C/C++ and Fortran compilers Intel® Threading Building Blocks Intel® Cilk™ Plus Intel® Integrated Performance Primitives • Intel® Math Kernel Library • Enabling solution to achieve the application performance and scalability benefits of multicore and forward scale to many-core Intel® MPI Library† High Performance Message Passing (MPI) Library • Enabling High Performance Scalability, Interconnect Independence, Runtime Fabric Selection, and Application Tuning Capability Intel® VTune™ Amplifier XE Performance Profiler for optimizing application performance and scalability • Remove guesswork, saves time, makes it easier to find performance and scalability bottlenecks Memory & threading dynamic analysis for code quality • Increased productivity, code quality, and lowers cost, finds memory, threading , and security defects before they happen Intel® Composer XE Intel® Inspector XE Intel® Trace Analyzer & Collector† • • • • Static Analysis for code quality MPI Performance Profiler for understanding application correctness & behavior • Analyze performance of MPI programs and visualize parallel application behavior and communications patterns to identify hotspots Efficiently Produce Fast, Scalable and Reliable Applications Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 24 Top New Features Performance Performance Profiling Improved compiler A dozen new and library analysis features performance Low overhead Java* profiling + Ivy Bridge microarchitecture CPU Power + Haswell Analysis microarchitecture Reliability Reproducibility Pointer checker Conditional numerical reproducibility Heap growth analysis Improved MPI fault tolerance† Parallelism Assistance Standards Expanded C++ 11 Expanded Fortran 2008 MPI 2.2† Analysis extended to include Linux*, Fortran and C# (in addition to Windows* and C/C++) + Intel® Xeon Phi™ coprocessor †Intel® Efficiently produce fast, scalable and reliable applications running on Windows* and Linux* 25 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Cluster Studio XE The Build Environment Tool Target MS Compiler 13‐0 ICC Macro 1st version STEP_0 13‐1 Find Hotspot STEP_1 ICC 13‐2 Add SSE Intrinsics STEP_2 VTune 13‐3 Find Hotspot STEP_3 ICC 13‐4 Add OpenMP Code STEP_4 Inspector 13‐5 Check Correctness STEP_5 Solver Generator ICC 13‐6 Fix Correctness STEP_6 VTune 13‐7 Tune Parallelism STEP_7 Build example make 13-0 or nmake 13-0 Key Serial Release Mode ICC 13‐8 Finish STEP_8 OpenMP Debug Mode OpenMP Release Mode 26 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. How to Run 13-0.exe test.txt 27 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Your Challenge – the hands-on Examine each of the eight stage and use a combination of the compiler, inspector, and amplifier to understand ‘what’s going on’ Answer these questions • Is the application using the • What’s the biggest • What • How well is the parallelism errors CPU hotspot at it’s best? (Steps 0, 2 and 8) in the serial code? (steps 1 and 3) were introduced into the parallelism? (Steps 4, 5 & 6) tuned? (Steps 7 & 8) Supplement: Why is the Linux version slower than the Windows Version? 28 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Thank You 29 Backup 30 Intel® Parallel Studio XE ® Intel Cluster Studio XE (30 minutes) 31 Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013 Helping Developers Efficiently Produce Fast, Scalable and Reliable Applications More Cores. Wider Vectors. Performance Delivered. Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013 More Cores Scaling Performance Efficiently Multicore Many-core 50+ cores Wider Vectors 128 Bits Serial Performance • Industry-leading performance from advanced compilers Task & Data Parallel Performance • Comprehensive libraries 256 Bits 512 Bits • Parallel programming models Distributed Performance • Insightful analysis tools 33 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. What’s New? Intel® Parallel Studio XE 2013/ Intel® Cluster Studio XE 2013 Performance Leadership: • 3rd Generation Intel® Core™ Processors (code name “Ivy Bridge”) and future Intel® processors (code name “Haswell”) • Intel® Xeon Phi™ coprocessors • Improved C++ and Fortran performance New Product Capabilities • Latest OS: Windows* 8 Desktop, Linux* • IDE: Visual Studio 2008, 2010, 2012 and gnu tool chain • Standards: C99, selected C++11 features, almost complete Fortran 2003 support and selected features from Fortran 2008, Fortran 2008, MPI 2.2 34 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Parallel Studio XE Intel® Compiler Cluster s& Studio Libraries XE Boost Performance 35 Support for Latest Intel Processors and Coprocessors Intel® Ivy Bridge microarchitecture Intel® Haswell microarchitecture Intel® Xeon Phi™ coprocessor Intel® C++ and Fortran Compiler ✔ AVX ✔ AVX2, FMA3 ✔ IMCI Intel® TBB library ✔ ✔ ✔ Intel® MKL library ✔ AVX ✔ AVX2, FMA3 ✔ Intel® MPI library ✔ ✔ ✔ Intel® VTune™ Amplifier XE† ✔ Hardware Events ✔ Hardware Events ✔ Hardware Events Intel® ✔ Memory & Thread Checks ✔ Memory & Thread ✔ Memory & Thread†† † †† Inspector XE Hardware events for new processors added as new processors ship. Analysis runs on multicore processors, provides analysis for multicore and many-core processors. 36 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Performance-Oriented Compiler Suites Intel® Compilers, Performance Libraries, Debugging Tools On Windows, Linux and Mac OS X Intel® C++ Composer XE 2013 • Intel® C++ Compiler XE 13.0 with Intel® Cilk™ Plus • Intel® TBB • Intel® MKL • Intel® IPP • Intel® Xeon Phi™ product family support, Linux Intel Composer XE 2013 Intel® Fortran Composer XE 2013 • Intel® Fortran Compiler XE 13.0 • Intel® MKL • Compatibility with Compaq Visual Fortran* • Fortran 2003, 2008 support • Intel® Xeon Phi™ product family support, Linux • Combines Intel C++ Composer XE and Intel® Fortran Composer XE • For Fortran developers who also want Intel C++ • Windows (requires Visual Studio) and Linux only Windows: Intel C++/Visual* C++ compatibility & integration into Microsoft* Visual Studio* Linux: Intel C++/gcc* compatibility & integration into Eclipse* CDT Mac OS X: Intel C++/gcc compatibility & integration into XCode* Environment All: Intel Fortran performance leadership, compatible with Compaq* Visual* Fortran All: Leadership performance on Intel and compatible architectures All: One Year Intel® Premier Support. Renewable Annually. Performance . Compatibility. Support. 37 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Superior C++ Compiler Performance More Performance • • • • Just recompile Uses Intel® AVX and Intel® AVX2 instructions Intel® Xeon Phi™ product family support, Linux: Compiler, debugger (Linux) Intel® Cilk™ Plus: Tasking and vectorization 38 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Superior Fortran Compiler Performance More Performance • • • • • • • Just recompile Intel® Xeon Phi™ product family: Linux compiler, debugger support Access to Intel® AVX and Intel® AVX2 instructions (-xa or /Qxa) Auto-parallelizer & directives to access SIMD instructions Coarrays & synchronization constructs support parallel programming Loop optimization directives: VECTOR, PARALLEL, SIMD More control over array data alignment (align arrayNbytes) 39 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. C++ Performance Guide Performance Wizard for Windows • Quick 5 step process for more performance • Get help choosing optimization options Gain Performance with Less Effort 40 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Parallel Studio XE Intel® Compiler Cluster s& Studio Libraries XE Intel® Math Kernel Library (MKL) • Highly optimized threaded math routines • Applications in science, engineering, finance • Use Intel® MKL on Windows*, Linux*, Mac OS* • Use Intel® MKL with Intel compiler, gcc, MSFT*, PGI • Component of Intel® Parallel Studio XE and Intel® Cluster Studio XE EDC North America Development Survey 2011, Volume II 33% of math libraries users rely on Intel’s Math Kernel Library Drop In The Next Intel® MKL Version to Unlock New Processor Performance 41 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. LAPACK Performance Improves with Intel® Math Kernel Library 42 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Compilers & Libraries Intel® Integrated Performance Primitives (IPP) A Library Of Highly Optimized Algorithmic Building Blocks For Media And Data Applications Optimized for Performance and Power Efficiency • Highly optimized using SSE, AVX instruction sets • Performance beyond what an optimized compiler produces alone Intel Engineered & Future Proofed to Save You Time • Ready-to-use & royalty free • Fully optimized for current and past processors • Save development, debug, and maintenance time • Code once now, receive future optimizations later Wide range of Cross Platform & OS Functionality • Thousands of optimized functions • Supports Windows*, Linux*, and Mac OS* X • Supports Intel® Atom, Intel® Core, Intel® Xeon, platforms Availability: Part of several different product packages with single, multi-user licenses as well as volume, academic, and student discounts available. Try it Before You Buy It: Download a trial version today at intel.com/software/products/eval Performance Building Blocks to Make Your Applications Faster, Faster 43 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® IPP Boost from Intel® AVX 44 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Intel® VTune™ Amplifier XE VTune™ Amplifier XE Performance Profiler Where is my application… Spending Time? • Focus tuning on functions taking time • See call stacks • See time on source • Windows & Linux • Low overhead • No special recompiles Wasting Time? Waiting Too Long? • See cache misses on your source • See functions sorted by # of cache misses • See locks by wait time • Red/Green for CPU utilization during wait We improved the performance of the latest run 3 fold. We wouldn't have found the problem without something like Intel® VTune™ Amplifier XE. Claire Cates Principal Developer, SAS Institute Inc. Advanced Profiling for Scalable Multicore Performance 45 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 45 A Dozen New Analysis Features Intel® VTune™ Amplifier XE 2013 More Profiling Data 1) Statistical Call Counts Data for Inlining & Parallelization 8) Java Tuning Lower overhead, Higher resolution Finds hot spots in small functions 9) Task Annotation API 10) User Defined Metrics More accurate bandwidth analysis 11) Programmable Hot Keys 2) Hardware Events + Stacks 3) Uncore Event Counting 4) Ivy Bridge Events 5) Haswell Events 6) Easier To Use 7) Source View for Inlined Code (For Intel® and GCC* 12) compilers) Results map to the Java source Label and visualize tasks. Create meaningful metrics from events Start and stop collection easily More/Better Advanced Profiles (e.g., Bandwidth) Updates as new processors ship Intel® Xeon Phi™ Products Hardware events Easy to Use, Wealth of Data, Powerful Analysis 46 Intel® VTune™ Amplifier XE Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Low Overhead Java* Profiling Intel® VTune™ Amplifier XE 2013 Low Overhead & Precise • Sampling is fast / unobtrusive Versatile & Easy to Use Multiple simultaneous JVMs Mixed Java / C++ / Fortran See results on the Java source • Hardware sampling even faster (Now with optional stacks!) • Advanced profiles are unique (cache misses, bandwidth…) Better Data, Lower Overhead, Easier to Use 47 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE CPU Power Analysis Intel® VTune™ Amplifier XE 2013 Intel® VTune™ Amplifier XE To decrease CPU power usage minimize wake-ups • Identify wake-up causes – Timers triggered by application – Interrupts mapped to HW intr level – Show wake-up rate • Display source code for events that wake-up processor • Show CPU frequencies by CPU core (CPU frequencies can change by CPU activity level) • Linux only Select & filter to see a single wake up object: Uniquely Identifies the Cause of Wake-ups and Give Timer Call Stacks 48 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Scale Forward 49 Simplify and Speed Threading Design Intel® Advisor XE Intel® Advisor XE – Threading Assistant The Challenge of Parallel Design: • Need to implement to measure performance • Implementation is time consuming • Disrupts regular product development • Testing difficult without tools Intel Advisor XE Separates Design & Implementation • Fast exploration of multiple options • Find errors before implementation • Design without disrupting development New! Linux* and Windows* New! C, C++, Fortran and C# code Add Parallelism with Less Effort, Less Risk and More Impact 50 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Design Then Implement Intel® Advisor XE 2013 – Threading Assistant Design Parallelism • No disruption to regular development • All test cases continue to work • Tune and debug the design before you implement it 1) Analyze it. 2) Design it. (Compiler ignores these annotations.) 3) Tune it. 4) Check it. Implement Parallelism 5) Do it! Less Effort, Less Risk, More Impact 51 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Advisor XE Scale Forward with Intel Parallel Models Extend to Intel® Xeon Phi™ Coprocessors Abstract, Scalable and Composable Intel® Cilk™ Plus Intel® Threading Building Blocks C/C++ language extensions to simplify parallelism Widely used C++ template library for thread management Support Standards OpenMP Coarray Fortran Intel® Xeon Processors, and Compatible Processors Intel® Xeon Phi™ product family Open programming models and also Intel products MPI Don’t Leave Your Code Behind 52 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Compilers & Libraries Compilers & Libraries Simplify Parallelism Intel® Cilk™ Plus, Intel® Threading Building Blocks Intel® Cilk™ Plus Intel® Threading Building Blocks What Language extensions to simplify task/data parallelism Widely used C++ template library for task parallelism Features • 3 simple keywords & array notations for parallelism • Support for task and data parallelism • Semantics similar to serial code • Parallel algorithms and data structures • Scalable memory allocation and task scheduling • Synchronization primitives • Simple way to parallelize your code • Sequentially consistent, low overhead, powerful solution • Supports C, C++, Windows and Linux • Rich feature set for general purpose parallelism • Available as open source or commercial license • Supports C++, Windows, Linux, Mac OS X, other OSs Why Task and Data Parallelism Made Easier 53 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Parallelize Applications For Performance ® Threading Building Blocks (TBB) Intel A popular, proven parallel C++ abstraction A C++ template library • Scalable memory allocation • Load-balancing • Work-stealing task scheduling • Thread-safe pipeline • Flexible flow graph • Concurrent containers • High-level parallel algorithms • Numerous synchronization primitives • Open source, and portable across many OSs "Intel® TBB provided us with optimized code that we did not have to develop or maintain for critical system services. I could assign my developers to code what we bring to the software table Michaël Rouillé, CTO, Golaem Simplify Parallelism with a Scalable Parallel Model 54 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Scale Forward and Extend to Intel® Xeon Phi™ Coprocessors Intel® Cilk™ Plus Intel® Cilk™ Plus (Language Extension to C/C++) Easier Task & Data Parallelism 3 simple Keywords: cilk_for, cilk_spawn, cilk_sync Intel® Cilk™ Plus Array Notation Save time with powerful vectorization Minimize Software Re-Work for New Hardware 55 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Increase Reliability 56 Intel® Parallel Studio XE Pointer Checker Intel® Compiler Cluster s& Studio Libraries XE Finds buffer overflows and dangling pointers before memory corruption occurs Powerful error reporting Integrates into standard debuggers (Microsoft, gdb, Intel) Dangling pointer Buffer Overflow { { char *p, *q; p = malloc(10); q = p; free(p); *q = 0; } char *my_chp = "abc"; char *an_chp = (char *) malloc (strlen((char *)my_chp)); memset (an_chp, '@', sizeof(my_chp)); } CHKP: Bounds check error Traceback: ./a.out(main+0x1b2) [0x402d7a] in file mems.c at line 13 Pointer Checker Highlights Programming Errors For More Secure Applications 57 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Compilers & Libraries Conditional Numerical Reproducibility Intel® Math Kernel Library: • New deterministic task scheduling and code path selection options OpenMP*: • New deterministic reduction option “I’m a C++ and Fortran developer and have high praise for the Intel® Math Kernel Library. One nice feature I’d like to stress is the numerical reproducibility of MKL which helps me get the assurance I need that I’m getting the same floating point results from run to run." Intel® Threading Building Blocks • New parallel deterministic reduceFranz Bernasek Owner / CEO , Senior Developer option MSTC Modern Software Technology Help Achieve Reproducible Results, Despite Non-associative Floating Point Math 58 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Expanded C++ 11 support • Additional type traits • Initializer lists (partial) • Generalized constant expressions (partial) • Noexcept (partial) • Range based for loops • Conversions of lambdas to function pointers Excellent Support for C++ 11 on Windows* and Linux* 59 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Compilers & Libraries Expanded Fortran 2008 Support • Maximum array rank has been raised to 31 dimensions (Fortran 2008 specifies 15) • Recursive type may have ALLOCATABLE components • Coarrays – CODIMENSION attribute – SYNC ALL statement – SYNC IMAGES statement – SYNC MEMORY statement – CRITICAL and END CRITICAL statements – LOCK and UNLOCK statements – ERROR STOP statement – ALLOCATE and DEALLOCATE may specify coarrays – Intrinsic procedures IMAGE_INDEX, LCOBOUND, NUM_IMAGES, THIS_IMAGE, UCOBOUND • CONTIGUOUS attribute • MOLD keyword in ALLOCATE • DO CONCURRENT • NEWUNIT keyword in OPEN Compilers & Libraries G0 and G0.d format edit descriptor Unlimited format item repeat count specifier CONTAINS section may be empty Intrinsic procedures BESSEL_J0, BESSEL_J1, BESSEL_JN, BESSEL_YN, BGE, BGT, BLE, BLT, DSHIFTL, DSHIFTR, ERF, ERFC, ERFC_SCALED, GAMMA, HYPOT, IALL, IANY, IPARITY, IS_CONTIGUOUS, LEADZ, LOG_GAMMA, MASKL, MASKR, MERGE_BITS, NORM2, PARITY, POPCNT, POPPAR, SHIFTA, SHIFTL, SHIFTR, STORAGE_SIZE, TRAILZ Additions to intrinsic module ISO_FORTRAN_ENV: ATOMIC_INT_KIND, ATOMIC_LOGICAL_KIND, CHARACTER_KINDS, INTEGER_KINDS, INT8, INT16, INT32, INT64, LOCK_TYPE, LOGICAL_KINDS, REAL_KINDS, REAL32, REAL64, REAL128, STAT_LOCKED, STAT_LOCKED_OTHER_IMAGE, STAT_UNLOCKED New: ATOMIC_DEFINE and ATOMIC_REF, initialization of polymorphic INTENT(OUT) dummy arguments, standard handling of G format and of printing the value zero, coarrays (more support), polymorphic source allocation Leadership F2008 Support on Linux*, Windows* & OSX* 60 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Dynamic Analysis Finds Memory & Threading Errors Intel® Inspector XE 2013 Find and eliminate errors • Memory leaks, invalid access… • Races & deadlocks • Analyze hybrid MPI cluster apps • Heap growth analysis Faster & Easier to use • Debugger breakpoints • Break on selected errors • Run faster to known error • Pause/resume collection • Narrow analysis focus • Better performance • Improved error suppression Find Errors Early When They are Less Expensive 61 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Inspector XE Heap Growth Analysis Intel® Inspector XE 2013 Does Application Memory Usage Mysteriously Grow? • Set an analysis interval with start and analysis end points – Click a button –or– – Use an API • See a list of memory allocations that are not freed in the interval • Quickly zero in on suspicious activity that contributes to heap growth Speeds Diagnosis of Difficult to Find Heap Errors 62 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Inspector XE Static Analysis Finds Coding and Security Errors Intel® Parallel Studio XE 2013 Find over 250 error types e.g.: • Incorrect directives • Security errors Easier to use • Choose your priority: - Minimize false errors - Maximize error detection • Hierarchical navigation of results • Share comments with the team Increased Accuracy & Speed • Detect errors without all source files • Better scaling with large code bases Code Complexity Metrics • Find code likely to be less reliable Find Errors and Harden your Security Static Analysis is only available in Studio XE bundles. It is not sold separately. 63 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Parallel Studio XE Intel® Compiler Cluster s& Studio Libraries XE Cluster Tools 64 Scale Forward, Scale Faster Intel Cluster Tools Intel® Compiler Cluster s & Studio Libraries XE Scale Performance – Perform on More Nodes • MPI Latency - Intel® MPI Library - Up to 6.5X as fast as alternative MPI libraries • Compiler Performance – Industry leading Intel® C/C++ & Fortran compilers Scale Forward – multicore now, many-core ready • Intel® MPI Library scales beyond 120k processes • Focused to preserve programming investments for multicore and many-core machines Scale Efficiently – Tune & Debug on More Nodes • Thread & Memory Correctness Checking – Intel® Inspector XE now MPI enabled across many nodes • Rapid Node Level Performance Profiling – Intel VTune Amplifier XE can identify hotspots faster and on thousands of nodes High Performance Standards Driven Fabric Flexible MPI Library 65 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 65 On the Path to Exascale Intel® MPI Library and (part of Cluster Studio XE 2013) Increased Scaling • 120k Processes Standards Support • MPI 2.2 18000 16000 90K Intel® MPI Library, K processes 14000 Processes Latest hardware support • Ivy Bridge and Haswell • Intel® Xeon Phi™ Coprocessor Intel® MPI Library 60K 12000 120K Doubling, K processes 10000 8000 6000 Exascale, K processes (estimated ) 4000 2000 0 2010 2011 2012 2013 2014 2015 2016 2017 2018 Year Continued Scaling Capacity to Meet Ever Growing HPC Demands 66 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Improved MPI Fault Tolerance Intel® MPI Library Checkpointing Implementation of Berkeley Lab Checkpoint/Restart (BLCR) † Primary Uses Fault Recovery Scenario • Scheduling • Process Migration • Failure Recovery Node Fault Checkpoint Recovery Enabling Capabilities for Robust at Scale MPI Computing 67 † Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® MPI Library MPI 2.2 Support Backwards compatible with MPI 2.1 programs Delivers Distributed Graph Topology Interface • Scalable & Informative for MPI Library Communications • Easy to Use Mechanism for Conveying Comms Patterns to MPI Applications • Used by MPI Library to Improve Mapping Process to Process Communications • Allows better fit for Applications Communications to Hardware Capabilities Outstanding Support Of The Latest MPI Standard 68 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimize MPI Communications Intel® ITAC Intel® Trace Analyzer and Collector (part of Cluster Studio XE 2013) Visually understand parallel application behavior • Communications Patterns • Hotspots • Load Balance Intel® Trace Analyzer and Collector (processes) MPI Checking • Detect Deadlocks • Data Corruption • Errors in Parameters, Data Types, etc Processes 7000 6000 5000 4000 3000 2000 1000 0 2010 2011 2012 Year Scaling • Analysis Capability increasing to 6k processes Expanding MPI Profiling Capacity for Communications Optimization 69 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 70 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Backup 72 Value of Suites Suite Only Features • Advisor XE Parallelism Advice • C++ Performance Guide Performance Wizard • Pointer Checker Reduces memory corruption • Code Complexity Analysis Find code likely to be less reliable • Static Analysis Improved! Find Errors and Harden your 73 Security Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Compiler s& Libraries What’s New in Libraries? Intel® MKL • Digital random number generator (DRNG) for improved vector statistics calculations • Automatically utilize Intel® Xeon Phi™ Coprocessors and balance compute loads between CPUs and coprocessors Intel® IPP • Enhanced image resize performance primitives • Improved IPP footprint size Intel® TBB "Intel® TBB provided us with optimized code that we did not have to develop or maintain for critical system services. I could assign my developers to code what we bring to the software table—crowd simulation software.” • Improved usability and reliability of the Flow Graph feature • Additional C++11 Support Michaël Rouillé, CTO, Golaem Ready to Use Libraries to Increase Performance 74 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 75 75 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.