Three Questions every one keeps asking

advertisement
Three Questions every
one keeps asking
Stephen Blair-Chappell
Intel Compiler Labs
Three Common Requests
“How can I make my program run
faster?”
“How can I make my program
parallel?”
“Will my code run on any CPU? -
compatibility”
2
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel® Parallel Studio XE
Intel® Composer XE
•
Amplifier
XE
• Profiler
Composer
XE
• Compiler
• Libraries
•
Intel® VTune™ Amplifier XE
•
•
Inspector XE
• Memory Errors
• Parallel Errors
Use to generate fast, safe, parallel code
(C/C++, Fortran)
Find hotspots and bottlenecks in you code.
Intel® Inspector XE
•
•
Use to find memory and threading errors
Three Components
3
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel® Parallel Studio XE
Amplifier
XE
•Profiler
Composer
XE
+ Advisor
Intel® Parallel Advisor
•
•Compiler
•Libraries
•
Use to model parallelism in your
existing applications
Inspector XE
•Memory Errors
•Parallel Errors
Intel® Composer XE
•
•
Use to generate fast, safe, parallel code
(C/C++, Fortran)
Intel® VTune™ Amplifier XE
•
•
Find hotspots and bottlenecks in you code.
Intel® Inspector XE
•
•
Use to find memory and threading
errors
Four Three Components
4
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Three Common Requests
“How can I make my program run
faster?”
“How can I make my program
parallel?”
“Will my code run on any CPU? -
compatibility”
5
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
The compiler uses many
optimisation techniques
Faster Code
fast floating point
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
http://software.intel.com/sites/products/collateral/hpc/compilers/compiler_qrg12.pdf
6
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Faster Code
Often we are happy with out-of-
the-box experience
When was the last time you looked
at some documentation?
7
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
The Seven Optimisation Steps
Step 1
start
Build with
optimization disabled
Step 2
Use General
Optimizations
Step 3
Use Processor-Specific
Options
Faster Code
Example options
Windows
(Linux)
/Od
(-O0)
/01,/02,/03
(-O1, -O2, -O3)
/QxSSE4.2
/QxHOST
(-xsse4.2)
(-xhost)
/Qipo
(-ipo)
/Qprof-gen
/Qprof-use
(-prof-gen)
(-prof-use)
/Qguide
(-guide)
Step 4
Add Inter-procedural
Step 5
Use Profile Guided
Optimization
Step 6
Tune automatic
vectorization
Step 7
Implement Parallelism
or use Automatic
Parallelism
Use Intel Family of Parallel Models
/Qparallel
(-parallel)
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Vectorisation is …
Faster Code
1999
2000
2004
2006
2007
2008
2009
2011
2012\2013
2012
SSE
SSE2
SSE3
SSSE3
SSE4.1
SSE4.2
AES-NI
AVX
AVX2
MIC
70 instr
144 instr
13 instr
8 instr
7 instr
Doubleprecision
Vectors
Complex
Data
32 instr
47 instr
SinglePrecision
Vectors
Decode
Video
String/XML
processing
Encryption
and
Decryption
~100 new
instr.
Int. AVX
expands to
256 bit
512-bit
vector
Streaming
operations
8/16/32
64/128-bit
vector
integer
Graphics
building
blocks
Advanced
vector instr
POP-Count
CRC
Key
Generation
256-bit
vector
3 and 4operand
instructions
a[3]
for (i=0;i<MAX;i++)
c[i]=a[i]+b[i];
+
Improved
bit manip.
fma
Vector
shifts
Gather
a[1]
a[2]
+
a[0]
+
+
b[3]
b[2]
b[1]
b[0]
c[3]
c[2]
c[1]
c[0]
9
8/2/2012
~300
legacy sse
instr
updated
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Different Ways of Inserting
Vectorised Code
Use Performance Libraries (e.g. IPP and MKL)
Compiler: Fully automatic vectorization
Cilk Plus Array Notation Compiler: Auto vectorization hints (#pragma ivdep, …)
User Mandated Vectorization
( SIMD Directive)
Manual CPU Dispatch (__declspec(cpu_dispatch …))
SIMD intrinsic class (F32vec4 add)
Vector intrinsic (mm_add_ps())
Assembler code (addps)
10
8/2/2012
Ease of use Faster Code
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Programmer control
An example
Faster Code
Speedup by
upgrading
silicon
Speedup by swapping compiler
Verified
using VTune
11
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Three Common Requests
“How can I make my program run
faster?”
“How can I make my program
parallel?”
“Will my code run on any CPU? -
compatibility”
12
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Speedup using parallelism
Parallel Code
Analyze
Implement
Debug
Implement
Tune
Compiler
13
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Tune
concurrency
Four Step Development
4
Amplifier XE
IPP
Debug
Memory
Libraries
MKL
TBB
3
Threads
OpenMP
Inspector XE
Cilk Plus
Locks & waits
2
Composer XE
EBS (XE only)
Hotspot
Analyze
Amplifier XE
1
Four Different Ways to Find the
Hotspots
1. Using Intel compiler’s
profile viewer
2. Using the compiler’s
3. Using
loop profiler &
Auto-parallelizer
Amplifier XE
4. Performing a
Survey with Advisor
14
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Parallel Code
Analyze
Implement
Debug
Tune
Language to help parallelism
Intel®
Parallel Code
#pragma omp parallel for
for(i=1;i<=4;i++) {
printf(“Iter: %d”, i);
}
Cilk™ Plus
OpenMP
Intel® Threading Building Blocks
Intel® MPI
Fortran Coarrays
OpenCL
cilk_for
(int i = 0; i < max_row; i++) {
for (int j = 0; j < max_col; j++ )
{
p[i][j] = mandel( complex(scale(i), scale(j)));
}
Native Threads
}
15
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Four Different Ways to Find your
Parallel Errors
1. Using
Inspector XE
2. Perform a
Static Security Analysis
Parallel Code
Analyze
Implement
Debug
Tune
3. Debug with
4. Use
Parallel Debug Extensions
Advisor
16
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
An example …
Parallel Code
3
1
2
4
5
1. Hotspot Analysis
2. Implement
3. Find Threading Errors
4,5,6. Tune Parallelism
6
https://makebettercode.com/parallel_landing_required/lib/pdf/5373_IN_ParallelMag_Sudoku_060911.pdf
17
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Three Common Requests
“How can I make my program run
faster?”
“How can I make my program
parallel?”
“Will my code run on any CPU? -
compatibility”
18
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Will my program run on any
CPU?
Compatible Code
Compatibility
• run?
Future Proofing
• build?
OS-agnostic
CPU-agnostic
Language / Standards
Tools
Scalability
• Performance?
19
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Vectorised
Parallel
On the graphs, bigger is better
20
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Running Example: Monte Carlo
#pragma omp parallel for
for(int opt = 0; opt < OPT_N; opt++)
{
float VBySqrtT = VOLATILITY * sqrtf(T[opt]);
float MuByT = (RISKFREE ‐ 0.5f * VOLATILITY * VOLATILITY) * T[opt];
float Sval = S[opt];
float Xval = X[opt];
float val = 0.0f, val2 = 0.0f;
#pragma simd reduction(+:val) reduction(+:val2)
for(int pos = 0; pos < RAND_N; pos++){
float callValue = expectedCall(Sval, Xval, MuByT, VBySqrtT, l_Random[pos]);
val += callValue;
val2 += callValue * callValue;
}
float exprt = expf(‐RISKFREE *T[opt]);
h_CallResult[opt] = exprt * val / (float)RAND_N;
float stdDev = sqrtf(((float)RAND_N*val2 ‐ val*val) / ((float)RAND_N*(float)(RAND_N – 1.f)));
h_CallConfidence[opt] =(float)(exprt * 1.96f * stdDev/sqrtf((float)RAND_N));
}
SFTL003 hands on lab
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel® Parallel Studio XE 2013 and
Intel® Cluster Studio XE 2013
Helping Developers Efficiently Produce
Fast, Scalable and Reliable Applications
More Cores. Wider Vectors. Performance Delivered.
Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013
More Cores
Scaling
Performance
Efficiently
Multicore Many-core
50+ cores
Wider Vectors
128 Bits
Serial
Performance
• Industry-leading
performance from advanced
compilers
Task & Data
Parallel
Performance
• Comprehensive libraries
256 Bits
512 Bits
• Parallel programming models
Distributed
Performance
• Insightful analysis tools
23
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013
Phase
Product
Intel®
Advisor XE
Build
Verify
& Tune
Feature
Threading design assistant
(Studio products only)
†
Benefit
• Simplifies, demystifies, and speeds
parallel application design
C/C++ and Fortran compilers
Intel® Threading Building Blocks
Intel® Cilk™ Plus
Intel® Integrated Performance
Primitives
• Intel® Math Kernel Library
• Enabling solution to achieve the
application performance and
scalability benefits of multicore and
forward scale to many-core
Intel®
MPI Library†
High Performance Message
Passing (MPI) Library
• Enabling High Performance
Scalability, Interconnect
Independence, Runtime Fabric
Selection, and Application Tuning
Capability
Intel®
VTune™
Amplifier XE
Performance Profiler for
optimizing application
performance and scalability
• Remove guesswork, saves time,
makes it easier to find performance
and scalability bottlenecks
Memory & threading dynamic
analysis for code quality
• Increased productivity, code quality,
and lowers cost, finds memory,
threading , and security defects
before they happen
Intel®
Composer XE
Intel®
Inspector XE
Intel® Trace
Analyzer &
Collector†
•
•
•
•
Static Analysis for code quality
MPI Performance Profiler for
understanding application
correctness & behavior
• Analyze performance of MPI
programs and visualize parallel
application behavior and
communications patterns to identify
hotspots
Efficiently Produce Fast, Scalable and Reliable Applications
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
24
Top New Features
Performance
Performance
Profiling
Improved compiler A dozen new
and library
analysis features
performance
Low overhead
Java* profiling
+ Ivy Bridge
microarchitecture
CPU Power
+ Haswell
Analysis
microarchitecture
Reliability
Reproducibility
Pointer
checker
Conditional
numerical
reproducibility
Heap growth
analysis
Improved MPI
fault tolerance†
Parallelism
Assistance
Standards
Expanded
C++ 11
Expanded
Fortran 2008
MPI 2.2†
Analysis
extended to
include Linux*,
Fortran and C#
(in addition to
Windows* and
C/C++)
+ Intel® Xeon Phi™
coprocessor
†Intel®
Efficiently produce fast, scalable and reliable applications
running on Windows* and Linux*
25
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Cluster Studio XE
The Build Environment
Tool
Target
MS Compiler
13‐0
ICC
Macro
1st version
STEP_0
13‐1
Find Hotspot
STEP_1
ICC
13‐2
Add SSE Intrinsics
STEP_2
VTune
13‐3
Find Hotspot
STEP_3
ICC
13‐4
Add OpenMP
Code
STEP_4
Inspector
13‐5
Check Correctness
STEP_5
Solver
Generator
ICC
13‐6
Fix Correctness
STEP_6
VTune
13‐7
Tune Parallelism
STEP_7
Build example
make 13-0
or nmake 13-0
Key
Serial Release Mode
ICC
13‐8
Finish
STEP_8
OpenMP Debug Mode
OpenMP Release Mode
26
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
How to Run
13-0.exe test.txt
27
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Your Challenge – the hands-on
Examine each of the eight stage and use a combination of the
compiler, inspector, and amplifier to understand ‘what’s going on’
Answer these questions
•
Is the application using the
•
What’s the biggest
•
What
•
How well is the parallelism
errors
CPU
hotspot
at it’s best? (Steps 0, 2 and 8)
in the serial code? (steps 1 and 3)
were introduced into the parallelism? (Steps 4, 5 & 6)
tuned? (Steps 7 & 8)
Supplement:
Why is the Linux version slower than the Windows Version?
28
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Thank You
29
Backup
30
Intel® Parallel Studio XE
®
Intel Cluster Studio XE
(30 minutes)
31
Intel® Parallel Studio XE 2013 and
Intel® Cluster Studio XE 2013
Helping Developers Efficiently Produce
Fast, Scalable and Reliable Applications
More Cores. Wider Vectors. Performance Delivered.
Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013
More Cores
Scaling
Performance
Efficiently
Multicore Many-core
50+ cores
Wider Vectors
128 Bits
Serial
Performance
• Industry-leading
performance from advanced
compilers
Task & Data
Parallel
Performance
• Comprehensive libraries
256 Bits
512 Bits
• Parallel programming models
Distributed
Performance
• Insightful analysis tools
33
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
What’s New?
Intel® Parallel Studio XE 2013/ Intel® Cluster Studio XE 2013
Performance Leadership:
• 3rd Generation Intel® Core™ Processors (code name
“Ivy Bridge”) and future Intel® processors
(code name “Haswell”)
• Intel® Xeon Phi™ coprocessors
• Improved C++ and Fortran performance
New Product Capabilities
• Latest OS: Windows* 8 Desktop, Linux*
• IDE: Visual Studio 2008, 2010, 2012 and gnu tool chain
• Standards: C99, selected C++11 features, almost
complete Fortran 2003 support and selected features
from Fortran 2008, Fortran 2008, MPI 2.2
34
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel®
Parallel
Studio
XE
Intel®
Compiler
Cluster
s&
Studio
Libraries
XE
Boost Performance
35
Support for Latest Intel
Processors and Coprocessors
Intel® Ivy Bridge
microarchitecture
Intel® Haswell
microarchitecture
Intel® Xeon Phi™
coprocessor
Intel® C++ and
Fortran Compiler
✔
AVX
✔
AVX2, FMA3
✔
IMCI
Intel® TBB library
✔
✔
✔
Intel® MKL library
✔
AVX
✔
AVX2, FMA3
✔
Intel® MPI library
✔
✔
✔
Intel® VTune™
Amplifier XE†
✔
Hardware Events
✔
Hardware Events
✔
Hardware Events
Intel®
✔
Memory & Thread
Checks
✔
Memory & Thread
✔
Memory & Thread††
†
††
Inspector XE
Hardware events for new processors added as new processors ship.
Analysis runs on multicore processors, provides analysis for multicore and many-core processors.
36
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Performance-Oriented Compiler Suites
Intel® Compilers, Performance Libraries, Debugging Tools
On Windows, Linux and Mac OS X
Intel® C++
Composer XE 2013
• Intel® C++ Compiler XE 13.0
with Intel® Cilk™ Plus
• Intel® TBB
• Intel® MKL
• Intel® IPP
• Intel® Xeon Phi™ product
family support, Linux
Intel Composer
XE 2013
Intel® Fortran
Composer XE 2013
• Intel® Fortran Compiler XE 13.0
• Intel® MKL
• Compatibility with Compaq
Visual Fortran*
• Fortran 2003, 2008 support
• Intel® Xeon Phi™ product
family support, Linux
• Combines Intel C++
Composer XE and Intel®
Fortran Composer XE
• For Fortran developers who
also want Intel C++
• Windows (requires Visual
Studio) and Linux only
Windows: Intel C++/Visual* C++ compatibility & integration into Microsoft* Visual Studio*
Linux: Intel C++/gcc* compatibility & integration into Eclipse* CDT
Mac OS X: Intel C++/gcc compatibility & integration into XCode* Environment
All: Intel Fortran performance leadership, compatible with Compaq* Visual* Fortran
All: Leadership performance on Intel and compatible architectures
All: One Year Intel® Premier Support. Renewable Annually.
Performance . Compatibility. Support.
37
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Superior C++ Compiler Performance
More Performance
•
•
•
•
Just recompile
Uses Intel® AVX and Intel® AVX2 instructions
Intel® Xeon Phi™ product family support, Linux: Compiler, debugger (Linux)
Intel® Cilk™ Plus: Tasking and vectorization
38
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Superior Fortran Compiler
Performance
More Performance
•
•
•
•
•
•
•
Just recompile
Intel® Xeon Phi™ product family: Linux compiler, debugger support
Access to Intel® AVX and Intel® AVX2 instructions (-xa or /Qxa)
Auto-parallelizer & directives to access SIMD instructions
Coarrays & synchronization constructs support parallel programming
Loop optimization directives: VECTOR, PARALLEL, SIMD
More control over array data alignment (align arrayNbytes)
39
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
C++ Performance Guide
Performance Wizard for Windows
• Quick 5 step process for more performance
• Get help choosing optimization options
Gain Performance with Less Effort
40
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel®
Parallel
Studio
XE
Intel®
Compiler
Cluster
s&
Studio
Libraries
XE
Intel® Math Kernel Library (MKL)
• Highly optimized threaded math routines
• Applications in science, engineering, finance
• Use Intel® MKL on Windows*, Linux*, Mac OS*
• Use Intel® MKL with Intel compiler, gcc, MSFT*,
PGI
• Component of Intel® Parallel Studio XE and Intel®
Cluster Studio XE
EDC North America
Development Survey
2011, Volume II
33% of math libraries users
rely on Intel’s Math Kernel
Library
Drop In The Next Intel® MKL Version to Unlock New Processor Performance
41
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
LAPACK Performance Improves with
Intel® Math Kernel Library
42
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Compilers
&
Libraries
Intel® Integrated Performance
Primitives (IPP)
A Library Of Highly Optimized Algorithmic Building Blocks
For Media And Data Applications
Optimized for
Performance and
Power Efficiency
• Highly optimized using SSE,
AVX instruction sets
• Performance beyond what
an optimized compiler
produces alone
Intel Engineered &
Future Proofed to
Save You Time
• Ready-to-use & royalty free
• Fully optimized for current
and past processors
• Save development, debug,
and maintenance time
• Code once now, receive
future optimizations later
Wide range of Cross
Platform & OS
Functionality
• Thousands of optimized
functions
• Supports Windows*, Linux*,
and Mac OS* X
• Supports Intel® Atom,
Intel® Core, Intel® Xeon,
platforms
 Availability: Part of several different product packages with single, multi-user licenses as well as
volume, academic, and student discounts available.
 Try it Before You Buy It: Download a trial version today at intel.com/software/products/eval
Performance Building Blocks to Make Your Applications Faster, Faster
43
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel® IPP
Boost from Intel® AVX
44
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel®
Intel®
VTune™
Amplifier
XE
VTune™ Amplifier XE
Performance Profiler
Where is my application…
Spending Time?
• Focus tuning on
functions taking time
• See call stacks
• See time on source
• Windows & Linux
• Low overhead
• No special recompiles
Wasting Time?
Waiting Too Long?
• See cache misses on
your source
• See functions sorted by
# of cache misses
• See locks by wait time
• Red/Green for CPU
utilization during wait
We improved the performance of the
latest run 3 fold. We wouldn't have
found the problem without something
like Intel® VTune™ Amplifier XE.
Claire Cates
Principal Developer, SAS Institute Inc.
Advanced Profiling for Scalable Multicore Performance
45
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
45
A Dozen New Analysis Features
Intel® VTune™ Amplifier XE 2013
More Profiling Data
1) Statistical Call Counts
Data for Inlining & Parallelization
8)
Java Tuning
Lower overhead, Higher resolution
Finds hot spots in small functions
9)
Task Annotation API
10)
User Defined Metrics
More accurate bandwidth analysis
11)
Programmable Hot Keys
2) Hardware Events + Stacks
3) Uncore Event Counting
4) Ivy Bridge Events
5) Haswell Events
6)
Easier To Use
7)
Source View for Inlined Code
(For Intel® and GCC*
12)
compilers)
Results map to the Java source
Label and visualize tasks.
Create meaningful metrics from events
Start and stop collection easily
More/Better Advanced Profiles
(e.g., Bandwidth)
Updates as new processors ship
Intel® Xeon Phi™ Products
Hardware events
Easy to Use, Wealth of Data, Powerful Analysis
46
Intel®
VTune™
Amplifier
XE
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Low Overhead Java* Profiling
Intel® VTune™ Amplifier XE 2013
Low Overhead & Precise
• Sampling is fast / unobtrusive
Versatile & Easy to Use
Multiple simultaneous JVMs
Mixed Java / C++ / Fortran
See results on the Java
source
• Hardware sampling even faster
(Now with optional stacks!)
• Advanced profiles are unique
(cache misses, bandwidth…)
Better Data, Lower Overhead, Easier to Use
47
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel®
VTune™
Amplifier
XE
CPU Power Analysis
Intel® VTune™ Amplifier XE 2013
Intel®
VTune™
Amplifier
XE
To decrease CPU power
usage minimize wake-ups
• Identify wake-up causes
– Timers triggered by
application
– Interrupts mapped to HW intr
level
– Show wake-up rate
• Display source code for events
that wake-up processor
• Show CPU frequencies by CPU
core
(CPU frequencies can change
by CPU activity level)
• Linux only
Select & filter to see a single wake up object:
Uniquely Identifies the Cause of Wake-ups and Give Timer Call Stacks
48
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Scale Forward
49
Simplify and Speed Threading Design
Intel®
Advisor
XE
Intel® Advisor XE – Threading Assistant
The Challenge of Parallel Design:
• Need to implement to measure
performance
• Implementation is time consuming
• Disrupts regular product development
• Testing difficult without tools
Intel Advisor XE
Separates Design & Implementation
• Fast exploration of multiple options
• Find errors before implementation
• Design without disrupting development
New! Linux* and Windows*
New! C, C++, Fortran and C# code
Add Parallelism with Less Effort, Less Risk and More Impact
50
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Design Then Implement
Intel® Advisor XE 2013 – Threading Assistant
Design Parallelism
• No disruption to
regular development
• All test cases
continue to work
• Tune and debug the
design before you
implement it
1) Analyze it.
2) Design it.
(Compiler ignores
these annotations.)
3) Tune it.
4) Check it.
Implement Parallelism
5) Do it!
Less Effort, Less Risk, More Impact
51
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel®
Advisor
XE
Scale Forward with Intel Parallel Models
Extend to Intel® Xeon Phi™ Coprocessors
Abstract, Scalable and Composable
Intel® Cilk™ Plus
Intel® Threading
Building Blocks
C/C++ language
extensions to simplify
parallelism
Widely used C++ template
library for thread
management
Support Standards
OpenMP
Coarray Fortran
Intel® Xeon
Processors, and
Compatible Processors
Intel® Xeon Phi™
product family
Open programming models and also Intel
products
MPI
Don’t Leave Your Code Behind
52
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Compilers
&
Libraries
Compilers
&
Libraries
Simplify Parallelism
Intel® Cilk™ Plus, Intel® Threading Building Blocks
Intel® Cilk™ Plus
Intel® Threading Building
Blocks
What
Language extensions to
simplify task/data parallelism
Widely used C++ template
library for task parallelism
Features
• 3 simple keywords & array
notations for parallelism
• Support for task and data
parallelism
• Semantics similar to serial
code
• Parallel algorithms and data
structures
• Scalable memory allocation
and task scheduling
• Synchronization primitives
• Simple way to parallelize
your code
• Sequentially consistent,
low overhead, powerful
solution
• Supports C, C++,
Windows and Linux
• Rich feature set for general
purpose parallelism
• Available as open source or
commercial license
• Supports C++, Windows,
Linux, Mac OS X, other OSs
Why
Task and Data Parallelism Made Easier
53
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Parallelize Applications For
Performance
® Threading Building Blocks (TBB)
Intel
A popular, proven parallel C++
abstraction
A C++ template library
• Scalable memory allocation
• Load-balancing
• Work-stealing task scheduling
• Thread-safe pipeline
• Flexible flow graph
• Concurrent containers
• High-level parallel algorithms
• Numerous synchronization
primitives
• Open source, and portable across
many OSs
"Intel® TBB provided us with optimized code that
we did not have to develop or maintain for critical
system services. I could assign my developers to
code what we bring to the software table
Michaël Rouillé, CTO, Golaem
Simplify Parallelism with a Scalable Parallel Model
54
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Scale Forward and Extend to
Intel® Xeon Phi™ Coprocessors
Intel® Cilk™ Plus
Intel® Cilk™ Plus (Language Extension to C/C++)
Easier Task & Data Parallelism
3 simple Keywords:
cilk_for, cilk_spawn, cilk_sync
Intel® Cilk™ Plus Array Notation
Save time with powerful vectorization
Minimize Software Re-Work for New Hardware
55
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Increase Reliability
56
Intel®
Parallel
Studio
XE
Pointer Checker
Intel®
Compiler
Cluster
s&
Studio
Libraries
XE
Finds buffer overflows and dangling pointers before
memory corruption occurs
Powerful error reporting
Integrates into standard debuggers (Microsoft, gdb, Intel)
Dangling pointer
Buffer Overflow
{
{
char *p, *q;
p = malloc(10);
q = p;
free(p);
*q = 0;
}
char *my_chp = "abc";
char *an_chp = (char *) malloc (strlen((char *)my_chp));
memset (an_chp, '@', sizeof(my_chp));
}
CHKP: Bounds check error
Traceback:
./a.out(main+0x1b2) [0x402d7a] in file mems.c at line 13
Pointer Checker Highlights Programming Errors For More Secure Applications
57
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Compilers
&
Libraries
Conditional Numerical Reproducibility
Intel® Math Kernel Library:
• New deterministic task scheduling
and code path selection options
OpenMP*:
• New deterministic reduction
option
“I’m a C++ and Fortran
developer and have high
praise for the Intel® Math
Kernel Library. One nice
feature I’d like to stress is
the numerical reproducibility
of MKL which helps me get
the assurance I need that
I’m getting the same
floating point results from
run to run."
Intel® Threading Building Blocks
• New parallel deterministic reduceFranz Bernasek
Owner / CEO , Senior Developer
option
MSTC Modern Software Technology
Help Achieve Reproducible Results,
Despite Non-associative Floating Point Math
58
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Expanded C++ 11 support
• Additional type traits
• Initializer lists (partial)
• Generalized constant expressions (partial)
• Noexcept (partial)
• Range based for loops
• Conversions of lambdas to function pointers
Excellent Support for C++ 11 on Windows* and Linux*
59
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Compilers
&
Libraries
Expanded Fortran 2008 Support
•
Maximum array rank has been raised to 31
dimensions (Fortran 2008 specifies 15)
•
Recursive type may have ALLOCATABLE components
•
Coarrays
–
CODIMENSION attribute
–
SYNC ALL statement
–
SYNC IMAGES statement
–
SYNC MEMORY statement
–
CRITICAL and END CRITICAL statements
–
LOCK and UNLOCK statements
–
ERROR STOP statement
–
ALLOCATE and DEALLOCATE may specify coarrays
–
Intrinsic procedures IMAGE_INDEX, LCOBOUND,
NUM_IMAGES, THIS_IMAGE, UCOBOUND
•
CONTIGUOUS attribute
•
MOLD keyword in ALLOCATE
•
DO CONCURRENT
•
NEWUNIT keyword in OPEN
Compilers
&
Libraries
G0 and G0.d format edit descriptor
Unlimited format item repeat count specifier
CONTAINS section may be empty
Intrinsic procedures
BESSEL_J0, BESSEL_J1, BESSEL_JN,
BESSEL_YN, BGE, BGT, BLE, BLT, DSHIFTL,
DSHIFTR, ERF, ERFC, ERFC_SCALED, GAMMA,
HYPOT, IALL, IANY, IPARITY, IS_CONTIGUOUS,
LEADZ, LOG_GAMMA, MASKL, MASKR,
MERGE_BITS, NORM2, PARITY, POPCNT,
POPPAR, SHIFTA, SHIFTL, SHIFTR,
STORAGE_SIZE, TRAILZ
Additions to intrinsic module
ISO_FORTRAN_ENV: ATOMIC_INT_KIND,
ATOMIC_LOGICAL_KIND,
CHARACTER_KINDS, INTEGER_KINDS, INT8,
INT16, INT32, INT64, LOCK_TYPE,
LOGICAL_KINDS, REAL_KINDS, REAL32,
REAL64, REAL128, STAT_LOCKED,
STAT_LOCKED_OTHER_IMAGE,
STAT_UNLOCKED
New: ATOMIC_DEFINE and ATOMIC_REF, initialization of polymorphic INTENT(OUT) dummy arguments,
standard handling of G format and of printing the value zero, coarrays (more support), polymorphic
source allocation
Leadership F2008 Support on Linux*, Windows* & OSX*
60
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Dynamic Analysis Finds Memory & Threading Errors
Intel® Inspector XE 2013
Find and eliminate errors
• Memory leaks, invalid access…
• Races & deadlocks
• Analyze hybrid MPI cluster apps
• Heap growth analysis
Faster & Easier to use
• Debugger breakpoints
• Break on selected errors
• Run faster to known error
• Pause/resume collection
• Narrow analysis focus
• Better performance
• Improved error suppression
Find Errors Early When They are Less Expensive
61
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel®
Inspector
XE
Heap Growth Analysis
Intel® Inspector XE 2013
Does Application Memory
Usage Mysteriously Grow?
• Set an analysis interval with
start and analysis end points
– Click a button –or–
– Use an API
• See a list of memory
allocations that are not freed
in the interval
• Quickly zero in on suspicious
activity that contributes to
heap growth
Speeds Diagnosis of Difficult to Find Heap Errors
62
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel®
Inspector
XE
Static Analysis Finds Coding and Security Errors
Intel® Parallel Studio XE 2013
Find over 250 error types e.g.:
• Incorrect directives
• Security errors
Easier to use
• Choose your priority:
- Minimize false errors
- Maximize error detection
• Hierarchical navigation of results
• Share comments with the team
Increased Accuracy & Speed
• Detect errors without all source files
• Better scaling with large code bases
Code Complexity Metrics
• Find code likely to be less reliable
Find Errors and Harden your Security
Static Analysis is only available in Studio XE bundles. It is not sold separately.
63
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel®
Parallel
Studio
XE
Intel®
Compiler
Cluster
s&
Studio
Libraries
XE
Cluster Tools
64
Scale Forward, Scale Faster
Intel Cluster Tools
Intel®
Compiler
Cluster
s
&
Studio
Libraries
XE
Scale Performance – Perform on More
Nodes
•
MPI Latency - Intel® MPI Library - Up to 6.5X as
fast as alternative MPI libraries
•
Compiler Performance – Industry leading Intel®
C/C++ & Fortran compilers
Scale Forward – multicore now, many-core
ready
•
Intel® MPI Library scales beyond 120k processes
•
Focused to preserve programming investments for
multicore and many-core machines
Scale Efficiently – Tune & Debug on More
Nodes
•
Thread & Memory Correctness Checking – Intel®
Inspector XE now MPI enabled across many nodes
•
Rapid Node Level Performance Profiling – Intel
VTune Amplifier XE can identify hotspots faster
and on thousands of nodes
High Performance Standards Driven Fabric Flexible MPI Library
65
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
65
On the Path to Exascale
Intel® MPI Library and (part of Cluster Studio XE 2013)
Increased Scaling
• 120k Processes
Standards Support
• MPI 2.2
18000
16000
90K
Intel® MPI
Library, K
processes
14000
Processes
Latest hardware
support
• Ivy Bridge and Haswell
• Intel® Xeon Phi™
Coprocessor
Intel®
MPI
Library
60K
12000
120K
Doubling,
K
processes
10000
8000
6000
Exascale,
K
processes
(estimated
)
4000
2000
0
2010 2011 2012 2013 2014 2015 2016 2017 2018
Year
Continued Scaling Capacity to Meet Ever Growing HPC Demands
66
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Improved MPI Fault Tolerance
Intel®
MPI
Library
Checkpointing
Implementation of
Berkeley Lab
Checkpoint/Restart
(BLCR) †
Primary Uses
Fault Recovery Scenario
• Scheduling
• Process Migration
• Failure Recovery
Node Fault
Checkpoint Recovery
Enabling Capabilities for Robust at Scale MPI Computing
67
†
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel®
MPI
Library
MPI 2.2 Support
Backwards compatible with MPI 2.1 programs
Delivers Distributed Graph Topology Interface
• Scalable & Informative for MPI Library Communications
• Easy to Use Mechanism for Conveying Comms Patterns
to MPI Applications
• Used by MPI Library to Improve Mapping Process to
Process Communications
• Allows better fit for Applications Communications to
Hardware Capabilities
Outstanding Support Of The Latest MPI Standard
68
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimize MPI Communications
Intel®
ITAC
Intel® Trace Analyzer and Collector (part of Cluster Studio XE 2013)
Visually understand parallel
application behavior
• Communications Patterns
• Hotspots
• Load Balance
Intel® Trace Analyzer and
Collector (processes)
MPI Checking
• Detect Deadlocks
• Data Corruption
• Errors in Parameters, Data Types, etc
Processes
7000
6000
5000
4000
3000
2000
1000
0
2010
2011
2012
Year
Scaling
• Analysis Capability increasing to 6k
processes
Expanding MPI Profiling Capacity for Communications Optimization
69
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY
ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR
IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on
Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of
those factors may cause the results to vary. You should consult other information and performance
tests to assist you in fully evaluating your contemplated purchases, including the performance of that
product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk
are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice.
Notice revision #20110804
70
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Backup
72
Value of Suites
Suite Only Features
• Advisor XE
Parallelism Advice
• C++ Performance Guide
Performance Wizard
• Pointer Checker
Reduces memory corruption
• Code Complexity
Analysis Find code likely to
be less reliable
• Static Analysis
Improved!
Find Errors and Harden your
73
Security
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Compiler
s&
Libraries
What’s New in Libraries?
Intel® MKL
• Digital random number generator (DRNG) for improved vector statistics
calculations
• Automatically utilize Intel® Xeon Phi™ Coprocessors and balance
compute loads between CPUs and coprocessors
Intel® IPP
• Enhanced image resize performance primitives
• Improved IPP footprint size
Intel® TBB
"Intel® TBB provided us with
optimized code that we did not have
to develop or maintain for critical
system services. I could assign my
developers to code what we bring to
the software table—crowd simulation
software.”
•
Improved usability and reliability of the Flow Graph feature
•
Additional C++11 Support
Michaël Rouillé, CTO, Golaem
Ready to Use Libraries to Increase Performance
74
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY
ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR
IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on
Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of
those factors may cause the results to vary. You should consult other information and performance
tests to assist you in fully evaluating your contemplated purchases, including the performance of that
product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk
are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice.
Notice revision #20110804
75
75
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Download