Presentation

advertisement
HPC Tutorial
Manoj Nambiar,
Performance Engineering Innovation Labs
Parallelization and Optimization CoE
www.cmgindia.org
Computer Measurement Group, India
0
A Common Expectation
Our ERP application has
slowed down. All the
departments are
complaining.
Let’s use HPC
Computer Measurement Group, India
-1-
1
Agenda
• Part – I
– A sample domain problem
– Hardware & Software
• Part – II Performance Optimization Case Studies
– Online Risk Management
– Lattice Boltzmann implementation
– OpenFOAM - CFD application (if time permits)
Computer Measurement Group, India
2
Designing an Airplane for performance ……
Thank You
Problem: Calculate Total Lift and Drag on the plane for a wind-speed of 150 m/s
Computer Measurement Group, India
3
Performance Assurance – Airplanes vs Software
Assurance
Approach
Testing
Airplane
Wind Tunnel Testing
Software
Load Testing with virtual
users
Simulation CFD Simulation
Discrete Event Simulation
Analytical
MVA, BCMP, M/M/k etc
None
Cost
Accuracy
Computer Measurement Group, India
4
CFD Example – Problem Decomposition
Methodology
1.Partition volume into cells
2. For a number of time steps
Thank You
2.a For each cell
2.a.1 calculate velocities
2.a.2 calculate pressure
2.a.3 calculate turbulence
All cells have to be in equilibrium
with each other.
Becomes a large AX=b problem.
This problem is partitioned into groups of cells which are assigned to CPUs
Each CPU can compute in parallel but the also have to communicate to each other
Computer Measurement Group, India
5
A serial algorith for Ax = B
Compute Complexity – O(n2)
Computer Measurement Group, India
6
What kind of H/W and S/W do we need
•
Take an example Ax=b solver
– Order of computational complexity is n2
– Where n is the number of cells in which the domain is divided
•
Higher the number of cells – Higher the accuracy
•
Typical number of cells
– In 10’s of millions
•
Very prohibitive to run sequentially
•
Increase in memory requirements will need proportionally higher number of servers
Parallel implementation is needed on a large cluster or servers
Computer Measurement Group, India
7
Software
• Lets look at the software aspect first
– Then we look at the hardware
Computer Measurement Group, India
8
Work Load Balancing
•
After solving Ax=B
– Some elements of x need to be exchanged with neighbor groups
– Every group (process) has to send and receive values with its neighbors
• For the next Guass Seidel iteration

Also need to check that all values of x have converged
Should this using TCP/IP or 3 tier web/app/database architecture?
Computer Measurement Group, India
9
Why TCP/IP wont suffice
•
Philosophically – NO
– These parallel programs are peers
– No one process is client or server
•
Technically – NO
– There can be as much as 10000 parallel processes
• Need to keep a directory of public server IP and port for each process
– TCP is a stream oriented protocol
• Applications need to pass messages
•
Changing the size of the cluster is tedious
Computer Measurement Group, India
10
Why 3 tier application will not suffice?
•
3 tier applications are meant to serve end user transactions
– This application is not transactional
•
Database is not needed for these applications
– No need to first persist and then read data
• This kind of I/O will impact performance significantly
• Better to store data in RAM
– ACID properties of the database are not required
• Applications are not transactional in nature
– SQL is a major overhead considering data velocity requirements
•
Managed Frameworks like J2EE, .NET not optimal for such requirements
Computer Measurement Group, India
11
MPI to the rescue
•
A message oriented interface
•
Has an API sparring 300 functions
– Support complex messaging requirements
•
A very simple interface for parallel programming
•
Also portable regardless of the size of the deployment cluster
Computer Measurement Group, India
12
MPI_Functions
• MPI_Send
• MPI_Recv
• MPI_Wait
• MPI_Reduce
–
–
–
–
SUM
MIN
MAX
….
Computer Measurement Group, India
13
Not so intuitive MPI calls
• MPI_Allgather(v)
• MPI_Scatter(v)
• MPI_Gather(v)
• MPI_All_to_All(v)
Computer Measurement Group, India
14
Not so intuitive MPI calls
• MPI_Allgather(v)
• MPI_Scatter(v)
• MPI_Gather(v)
• MPI_All_to_All(v)
Computer Measurement Group, India
15
Not so intuitive MPI calls
• MPI_Allgather(v)
• MPI_Scatter(v)
• MPI_Gather(v)
• MPI_All_to_All(v)
Computer Measurement Group, India
16
Not so intuitive MPI calls
• MPI_Allgather(v)
• MPI_Scatter(v)
• MPI_Gather(v)
• MPI_All_to_All(v)
Computer Measurement Group, India
17
Sample MPI program – parallel addition of a large array
Computer Measurement Group, India
18
MPI – Send, Recv and Wait
If you have some computation to be done
while waiting to receive a message from a peer
This is the place to do it
Computer Measurement Group, India
19
Hardware
• Lets look at the Hardware
–
–
–
–
Clusters
Servers
Coprocessors
Parallel File System
Computer Measurement Group, India
20
HPC Cluster
Not very different from regular data center clusters
Computer Measurement Group, India
21
Now lets look inside a server
NUMA
Coprocessor’s
go here
Computer Measurement Group, India
22
Parallelism in Hardware
• Multi-server/Multi-node
• Multi-sockets
• Multi-core
Mult-socket server board
• Co-processors
– Many Core
– GPU
• Vector Processing
Multi-core CPU
Computer Measurement Group, India
23
Coprocessor - GPU
PCIE Card
•
•
•
SM – Streaming Multi-processor
Device RAM – high speed GDDR5 RAM
Extreme multi-threading – thousands of threads
Computer Measurement Group, India
24
Inside a GPU streaming multiprocessor (SM)
•
An SM can be compared to a CPU core
•
A GPU core is essentially an ALU
•
All cores execute the same instruction at a time
– What happens to “if-then-else”?
•
A warp is software equivalent of a CPU thread.
– Scheduled independently
– A warp instruction executed by all cores at a time
•
Many warps can be scheduled on an SM
– Just like many threads on a CPU
– When 1 warp is scheduled to run other warps are moving data
•
A collection of warps concurrently running on an SM make a block
– Conversely an SM can run only one block at a time
Efficiency is achieved when there is one warp in 1 stage
of the execution pipeline
Computer Measurement Group, India
25
How S/W runs on the GPU
1.
A CPU process/thread initiates data transfer from CPU memory to GPU memory
2.
The CPU invokes a function (kernel) that runs on the GPU
– CPU specifies the number of blocks and blocks per thread
– Each block is scheduled on one SM
– After all blocks complete execution CPU is woken
This is known as offload mode of execution
3.
CPU fetches the kernel output from the GPU memory
Computer Measurement Group, India
26
Co-Processor – Many Integrated Core (MiC)
•
•
Cores are same as Intel Pentium CPU’s
– With vector processing instructions
L2 Level cache is accessible by all the cores
Execution Modes
• Native
• Offload
• Symmetric
Computer Measurement Group, India
27
What is vector processing?
for(i=1; i< 8; i++) c[i] = a[i]+b[i];
A
B
ALU in an ordinary CPU core
ADD C, A, B
1 arithmetic operation
per instruction cycle
C
Vector registers
A1 A2 A3 A4 A5 A6 A7 A8
VADD C, A, B
ALU in an CPU core with vector processing
B1 B2 B3 B4 B5 B6 B7 B8
8 arithmetic operations
per instruction cycle
C1 C2 C3 C4 C5 C6 C7 C8
Computer Measurement Group, India
28
HPC Networks – Bandwidth and Latency
Computer Measurement Group, India
29
Hierarchical network
End of row switch
Top of rack
•
The most intuitive design of a network
– Not uncommon in data centers
•
What happens when the 1st 8 nodes need to communicate to the next 8?
– Remember that all links have the same bandwidth
Computer Measurement Group, India
30
Clos Network
•
Can be likened to a replicated hierarchical network
– All nodes can talk to all other nodes
– Dynamic routing capability essential in the switches
Computer Measurement Group, India
31
Common HPC Network Technology - Infiniband
•
Technology used for building high throughput low latency network
– Competes with Ethernet
•
To use Infiniband
– You need a separate NiC on the server
– An Infiniband switch
– An Infiniband Cable
•
Messaging supported in Infiniband
– a direct memory access read from or, write to, a remote node
• (RDMA).
– a channel send or receive
– a transaction-based operation (that can be reversed)
– a multicast transmission.
– an atomic operation
Computer Measurement Group, India
32
Parallel File Systems - Lustre
•
•
Parallel file systems give the same file system interface to legacy applications
Can be built out of commodity hardware and storage.
Computer Measurement Group, India
33
HPC Applications - Modeling and Simulation
• Aerodynamics
Prototype
– Vehicular design
• Energy and
Resources
Simulation
– Seismic Analysis
– Geo-Physics
– Mining
Physical Experimentation
Lab Verification
Final Design
• Molecular Dynamics
– Drug Discovery
– Structural Biology
• Weather Forecasting
HPC or
no HPC?
Accuracy
Speed
Power
Cost
From Natural Science to Software
Computer Measurement Group, India
34
Relatively Newer & Upcoming Applications
•
•
•
Finance
– Risk Computations
– Options Pricing
– Fraud Detection
– Low Latency trading
Image Processing
– Medical Imaging
– Image Analysis
– Enhancement and Restoration
 Video Analytics
– Face Detection
– Surveillance
 Internet of Things
– Smart City
– Smart Water
– eHealth
Bio-Informatics
– Genomics
Knowledge of core algorithms is key
Computer Measurement Group, India
35
Technology Trends Impacting Performance &
Availability
• Multi-Core, Speeds not increasing
• Memory Evolution
– Lower memory per core
– Relatively Low Memory Bandwidth
– Deep Cache & Memory Hierarchies
• Heterogeneous Computing
– Coprocessors.
• Vector Processing
 Temperature fluctuation induced
slowdowns
 Memory error induced
slowdowns
 Network communication errors
 Large sized cluster
– Increased failure probability
Algorithms need to be re-engineered to make best use of trends
Computer Measurement Group, India
36
Knowing Performance Bounds
• Amdahl’s Law
– Maximum speed up achievable sp = (s + (1-s)/p)-1
– Where s is the fraction of code that has to run sequentially
Also Important to take problem size into account
when estimating speedups
Compute/Communication ratio is key.
Typically – Higher the problem size
- higher the the ratio
- Better the speed up
Computer Measurement Group, India
37
Quick Hardware Recap
FLOPS Bound
Bandwidth Bound
What about server clusters?
Computer Measurement Group, India
38
FLOPS and Bandwidth dependencies
• FLOPS – Floating operations per second
–
–
–
–
–
Frequency
No of CPU sockets
No of cores/per socket
No of Hyper-threads per core
No of vector units per core / hyperthead
• Bandwidths (Bytes/sec)
– Level in the hierarchy – Registers, L1, L2, L3, DRAM
– Serial / Parallel
– Memory attached to same CPU socket or another CPU
Why are we not talking about memory latencies?
Computer Measurement Group, India
39
Know your performance bounds
GPU
•
•
Above information can also be obtained from product data sheets
What do you gain by knowing performance bounds?
Computer Measurement Group, India
40
Other ways to gauge performance
• CPU speed
– SPEC – integer and floating point benchmark
• Memory Bandwidth
– Streams benchmark
Computer Measurement Group, India
41
Basic Problem
• Consider the following code
– double a[N], b[N], c[N], d[N];
– int i;
– for (i = 0; i < N-1; i++) a[i] = b[i] + c[i]*d[i];
• If N = 1012
• And the code has to complete in 1 second?
– How many Xeon E5-2670 CPU sockets would you need?
– Is this memory bound or CPU bound?
Computer Measurement Group, India
42
General guiding principles for performance optimization
•
Minimize communication requirements between parallel processes / threads
•
If communication is essential then
– Hide communication delays by overlapping compute and communication
•
Maximize data locality
– Helps caching
– Good NUMA page placement
•
Do not forget to use compiler optimization flags
•
Implement weighted decomposition of workload
– In a cluster with heterogeneous compute capabilities
Let your profiling results guide you on the next steps
Computer Measurement Group, India
43
Optimization Guidelines for GPU platforms
•
Minimize use of “if-then-else” or any other branching
– they cause divergence
•
Tune the number of threads per block
– Too many will exhaust caches and registers in the SM
– Too few will underutilize GPU capacity
•
Use device memory for constants
•
Use shared memory for frequently accessed data
•
Use sequential memory access instead of strided
•
Coalesce memory accesses
•
Use streams to overlap compute and communications
Computer Measurement Group, India
44
Steps in designing parallel programs
Data Structure
• Partitioning
Primitive Tasks
• Communication
• Agglomeration
• Mapping
Computer Measurement Group, India
45
Steps in designing parallel programs
• Partitioning
• Communication
• Combine sender and receiver
• Eliminate communication
• Increase Locality
• Agglomeration
• Combine senders and receivers
• Reduces number of message transmissions
• Mapping
Computer Measurement Group, India
46
Steps in designing parallel programs
• Partitioning
• Communication
• Agglomeration
NODE 1 NODE 2 NODE 3
• Mapping
Computer Measurement Group, India
47
Agenda
• Part – I
– A sample domain problem
– Hardware & Software
• Part – II Performance Optimization Case Studies
– Online Risk Management
– Lattice Boltzmann implementation
– OpenFOAM - CFD application Xeon Phi (if time permits)
Computer Measurement Group, India
48
Multi-core Performance Enhancement: Case Study
Computer Measurement Group, India
49
Background
• Risk Management in a commodities exchange
• Risk computed post trade
– Clearing and settlement – T+2
• Risk details updated on screen
– Alerting is controlled by human operators
Computer Measurement Group, India
50
Commodities Exchange: Online Risk Management
Trading
System
Clearing Member
Client1
ClientK
Online
Trades
Risk
Management
System
Alerts
Prevent
Client/Clearing
Member from
Trading
Collateral
Falls
Short
Initial Deposit of Collateral
Long/Short Positions on Contracts
Contract/Commodity Price Changes
Risk Parameters Change during day
Computer Measurement Group, India
51
Will standard architecture on commodity servers suffice?
Application Server
2 CPU
Database Server
2 CPU
Risk Management System?
Computer Measurement Group, India
52
Commodities Exchange: Online Risk Management
Computations:
• Position Monitoring, Mark to Market, P&L, Open Interest,
Exposure Margins
• SPAN: Initial Margin (Scanning Risk), Inter-Commodity
Spread Charge, Inter-Month Spread Charge, Short Option
Margin, Net Option Value
• Collateral Management
Functionality is complex
Let’s look at a simpler problem that reflects the same computational
challenge & come back later
Computer Measurement Group, India
53
Workload Requirements
• Trades/Day : 10 Million
• Peak Trades/Sec : 300
• Traders : 1 Million
Computer Measurement Group, India
54
P&L Computation
Trader A
Time
Txn
Stock
Quantity
Price
t1
BUY
Cisco
100
950
Total
Amount
95,000
t2
BUY
IBM
200
30
6000
t3
SELL
Cisco
40
975
39,000
t4
SELL
IBM
200
31
6200
Profit(Cisco, t4) = -95000 + 39000 + (100-40)*970
= -56000 + 58200 = 2200
Current
Cisco
price is
970
Biggest culprit
In general Profit on a given stock S at time t:
= sum of txn values up to time t +
(netpositions on stock at time t) * price of stock at time t
Buy txns take –ve value, sell +ve value
Computer Measurement Group, India
55
P&L Computation
int
int
int
int
profit[MAXTRADERS]; // array of trader profits
netpositions[MAXTRADERS][MAXSTOCKS]; // net positions per stock
sumtxnvalue[MAXTRADERS][MAXSTOCKS]; // net transaction values
profitperstock[MAXTRADERS][MAXSTOCKS];
loop forever
t = get_next_trade();
sumtxnvalue[t.buyer][t.stock] − = t.quantity * t.price;
sumtxnvalue[t.seller][t.stock] + = t.quantity * t.price;
netpositions[t.buyer][t.stock] + = t.quantity;
netpositions[t.seller][t.stock] − = t.quantity;
loop for all traders r
profit[r] = profit[r] – profitperstock[r][t.stock];
profitperstock[r][t.stock] = sumtxnvalue[r][t.stock] +
netpositions[r][t.stock] * t.price;
profit[r] = profit[r]+ profitperstock[r][t.stock];
end loop
end loop
Computer Measurement Group, India
56
P&L Computational Analysis
• Profit has to be kept updated for every price change
– For all traders
• Inner Loop: 8 Computations
– 4 Computations (+ + * +)
– Loop Counter
– 3 Assignments
• Actual Computational Complexity
– 20 times as complex as displayed algorithm
• Number of traders: 1 million
Computer Measurement Group, India
57
P&L Computational Analysis
• SLA Expectation: 300 trades / sec
• Computations/trade
– 8 computations x 1 million traders x 20 = 160 million
• Computations/sec = 160 million x 300 trades/sec
– 48 billion computations/sec!
• Out of reach of contemporary servers that time!
Can we deliver
within
an
IT
budget?
Computer Measurement Group, India
58
Test Environment
• Server
• 8 Xeon 5560 cores
• 2.8 GHz
• 8 GB RAM
• OS: Centos 5.3
• Linux kernel 2.6.18
• Programming Language : C
• Compilers: gcc and icc
Computer Measurement Group, India
59
Test Inputs
Number of Trades
1 Million
Number of Traders
Number of Stocks
Trade File Size
100,000
100
20 MB
Trade Distribution
Trades %
20%
Stock %
30%
20 %
60%
60%
10%
Computer Measurement Group, India
60
P&L Computation: Baselining
Trades/sec
Baseline Performance gcc
190
gcc –O3
323
Overall Gain
Computer Measurement Group, India
70%
61
P&L Computation: Transpose
int
int
int
int
profit[MAXTRADERS]; // array of trader profits
netpositions[MAXTRADERS][MAXSTOCKS]; // net positions per stock
sumtxnvalue[MAXTRADERS][MAXSTOCKS]; // net transaction values
profitperstock[MAXTRADERS][MAXSTOCKS];
loop forever
t = get_next_trade();
Trader
sumtxnvalue[t.buyer][t.stock] − = t.quantity * t.price;
r1
sumtxnvalue[t.seller][t.stock] + = t.quantity * t.price;
netpositions[t.buyer][t.stock] + = t.quantity;
Stock
s1
Trade t
Stock
si
r2
r3
netpositions[t.seller][t.stock] − = t.quantity;
Very Poor Caching
loop for all traders r
profit[r] = profit[r] – profitperstock[r][t.stock];
profitperstock[r][t.stock] = sumtxnvalue[r][t.stock] +
netpositions[r][t.stock] * t.price;
profit[r] = profit[r]+ profitperstock[r][t.stock];
end loop
end loop
Computer Measurement Group, India
62
62
Matrix Layout
Trader
Stock
s1
.
Stock
si
r1
r2
r3
Memory Layout
Trader r1
S S
1 2
Trader r2
S S S
i 1 2
Trader r3
S S S
i 1 2
Trader r4
S S S
i 1 2
Computer Measurement Group, India
S
i
63
Matrix Layout - Optimized
Stock
Trader
r1
Trader
r2
Trader
r3
S1
S2
S3
Optimized Memory Layout
Stock S1
r1r2
Stock S2
rn r1r2
Stock S3
rnr1r2
Stock S4
rnr1r2
Computer Measurement Group, India
rn
64
P&L Computation: Transpose
int
int
int
int
profit[MAXTRADERS]; // array of trader profits
netpositions[MAXSTOCKS][MAXTRADERS]; // net positions per stock
sumtxnvalue[MAXSTOCKS][MAXTRADERS]; // net transaction values
profitperstock[MAXSTOCKS][MAXTRADERS];
loop forever
Stock
t = get_next_trade();
sumtxnvalue[t.buyer][t.stock] − = t.quantity * t.price;
Trade
t s1
sumtxnvalue[t.seller][t.stock] + = t.quantity
* t.price;
Trader
r1
Trader
ri
netpositions[t.buyer][t.stock] + = t.quantity;
Good Caching si
netpositions[t.seller][t.stock] −Very
= t.quantity;
loop for all traders r
profit[r] = profit[r] – profitperstock[t.stock][r];
profitperstock[t.stock][r] = sumtxnvalue[t.stock][r] +
netpositions[t.stock][r] * t.price;
profit[r] = profit[r]+ profitperstock[t.stock][r];
end loop
end loop
Computer Measurement Group, India
65
P&L Computation: Transpose
Trades/
sec
Overall Gain
Baseline
Performance gcc
190
gcc –O3
323
1.7X
4750
25X
Transpose of
Trader/Stock
Immediate Gain
14.7X
Intel Compiler
icc –fast (not –O3)
Trades/
sec
Overall Gain
Immediate Gain
6850
36X
37%
Computer Measurement Group, India
66
P&L Computation: Use of Partial Sums
int
int
int
int
profit[MAXTRADERS]; // array of trader profits
netpositions [MAXSTOCKS] [MAXTRADERS]; // net positions per stock
sumtxnvalue [MAXSTOCKS] [MAXTRADERS]; // net transaction values
profitperstock[MAXSTOCKS] [MAXTRADERS];
loop forever
This can be
maintained
sumtxnvalue[t.buyer][t.stock] − = t.quantity * t.price;
cumulatively for the
trader. Need not be
sumtxnvalue[t.seller][t.stock] + = t.quantity * t.price;
per stock.
netpositions[t.buyer][t.stock] + = t.quantity;
t = get_next_trade();
netpositions[t.seller][t.stock] − = t.quantity;
loop for all traders r
profit[r] = profit[r] – profitperstock[r][t.stock];
profitperstock[r][t.stock] = sumtxnvalue[r][t.stock] +
netpositions[r][t.stock] * t.price;
profit[r] = profit[r]+ profitperstock[r][t.stock];
end loop
end loop
Computer Measurement Group, India
67
P&L Computation: Use of Partial sums
int
int
int
int
int
profit[MAXTRADERS]; // array of trader profits
netpositions [MAXSTOCKS] [MAXTRADERS]; // net positions per stock
sumtxnvalue [MAXTRADERS]; // net transaction values
sumposvalue [MAXTRADERS]; // sum of netpositions * stock price
ltp[MAXSTOCKS]; // latest stock price (last traded price)
Monetary
loop forever
Value of all
t = get_next_trade();
stock
sumtxnvalue[t.buyer] − = t.quantity * t.price;
positions
time of trade
sumtxnvalue[t.seller] + = t.quantity * t.price;
netpositions[t.buyer][t.stock] + = t.quantity;
netpositions[t.seller][t.stock] − = t.quantity;
sumposvalue[t.buyer] + = t.quantity * ltp[t.stock];
sumposvalue[t.seller] − = t.quantity * ltp[t.stock];
loop for all traders r
sumposvalue[r] + = netpositions[t.stock] [r]* (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
Trades/
Overall Gain
end loop
sec
Computer Measurement Group, India
ltp[t.stock] =Use
t.price;
of Partial Sum
9650
50X
Immediate
Gain
68
68
41%
P&L Computation: Skip Zero Values
Majority of the
values of this
matrix are 0,
thanks to hot
stocks
int netpositions [MAXSTOCKS] [MAXTRADERS];
loop for all traders r
if (netpositions[t.stock][r] != 0)
sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
endif
end loop
Skip Zero Values
Trades/
sec
Overall Gain
Immediate
Gain
10800
56X
12%
Computer Measurement Group, India
69
P&L Computation: Cold Stocks
• There is a large percentage of cold stocks
– Those which are held by very few traders
• In the last optimization an “if” check was added to avoid
computation
– If the trader does not hold the traded stock
• Is there any benefit if the trader record is not accessed at all?
– We are computing for 100,000 traders
Computer Measurement Group, India
70
P&L Computation: Sparse Matrix Representation
Flags Table – This
Stock owned by who?
Stock
A
B
C
D
E
s1
1
1
0
0
0
s2
1
1
1
0
0
s3
1
0
0
1
1
Updated in outer loop
Traversed in outer loop
Stock
Count
T0
T1
T2
.
.
s1
2
A
B
0
0
0
s2
3
A
C
B
0
0
s3
3
A
E
D
0
0
Traders Indexes/stock
Computer Measurement Group, India
71
P&L Computation: Sparse Matrix Representation
int
int
int
int
int
profit[MAXTRADERS]; // array of trader profits
netpositions [MAXSTOCKS] [MAXTRADERS]; // net positions per stock
sumtxnvalue [MAXTRADERS]; // net transaction values
sumposvalue [MAXTRADERS]; // sum of netpositions * stock price
ltp[MAXSTOCKS]; // latest stock price (last traded price)
loop forever
t = get_next_trade();
sumtxnvalue[t.buyer] − = t.quantity * t.price;
sumtxnvalue[t.seller]
+ = t.quantity * t.price;
netpositions[t.buyer][t.stock] + = t.quantity;
netpositions[t.seller][t.stock] − = t.quantity;
sumposvalue[t.buyer] + = t.quantity * ltp[t.stock];
sumposvalue[t.seller] − = t.quantity * ltp[t.stock];
loop for all traders r
Traverse list
of trader
count for
stock less
than
threshold
sumposvalue[r] + = netpositions[t.stock]
[r]* (t.price
Trades/
Overall
Gain − ltp[t.stock]);
Immediate
profit[r] = sumtxnvalue[r] + sumposvalue[r];
sec
Gain
end loop
Computer Measurement
72
Sparse Matrix
36000 Group, India
189X
3.24X
72
P&L Computation: Clustering
int profit[MAXTRADERS];
int sumtxnvalue
[MAXTRADERS];
int sumposvalue
[MAXTRADERS];
struct
{
int
int
int
}
Poor caching for sparse matrix lists
TraderRecord
profit;
sumtxnvalue
sumposvalue;
Clustering
Better caching performance!
Trades/
sec
Overall Gain
Immediate
Gain
70000
368X
94%
Computer Measurement Group, India
73
P&L Computation: Precompute Price Difference
int netpositions [MAXSTOCKS] [MAXTRADERS];
Loop Invariant:
Move to outside
the loop
loop for all traders r
if (netpositions[t.stock][r] != 0)
sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
end loop
Trades/sec
Clustering
75000
Overall Gain Immediate Gain
394X
Computer Measurement Group, India
7%
74
P&L Computation: Loop Unrolling
#pragma unroll
loop for all traders r
if (netpositions[t.stock][r] != 0)
sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
end loop
Clustering
Trades/
sec
Overall Gain
Immediate Gain
80000
421X
7%
Computer Measurement Group, India
75
Commodities Exchange: Online Risk Management
Trading
System
Clearing Member
Client1
ClientK
Online
Trades
Risk
Management
System
Alerts
Prevent
Client/Clearing
Member from
Trading
Collateral
Falls
Short
Initial Deposit of Collateral
Long/Short Positions on Contracts
Contract/Commodity Price Changes
Risk Parameters
Change during day76
Computer Measurement
Group, India
P&L Computation: Batching of Trades
Batch n trades and use ltp of last trade // increases risk by a small delay
loop for all traders r
if (netpositions[t.stock][r] != 0)
sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
end loop
Trades/sec
Overall Gain Immediate Gain
Batching of 100 trades
150000
789X
1.88X
Batching of 1000 trades
400000
2105X
2.67X
So far all this is with only one thread!!!
Computer Measurement Group, India
77
P&L Computation: Use of Parallel Processing
#pragma openmp with chunks (32 threads on 8 core Intel server)
loop for all traders r
if (netpositions[t.stock][r] != 0)
sumposvalue[r] + = netpositions[t.stock][r] * (t.price − ltp[t.stock]);
profit[r] = sumtxnvalue[r] + sumposvalue[r];
end loop
OpenMP
Trades/sec
Overall
Gain
Immediate
Gain
1.2 million
5368X
2.55X
Computer Measurement Group, India
78
P&L Computation: Summary of Optimizations
Optimization
Trades/sec
Baseline gcc
190
gcc –O3
320
Immediate
Gain
Overall
Gain
1.70X
1.7X
Transpose of Trader/Stock
4750
14.70X
25X
Intel Compiler icc –fast
6850
1.37X
36X
Use of Partial Sums
9650
1.41X
50X
Skip Zero Values
10,800
1.12X
56X
Sparse Matrix
36,000
3.24X
189X
Clustering of Arrays
70,000
1.94X
368X
Precompute Price Diff
75,000
1.07X
394X
Loop Unrolling
80,000
1.07X
421X
Batching of 100 Trades
150,000
1.88X
Batching of 1000 Trades
400,000
2.67X
1,020,000
2.55X
OpenMP
Computer Measurement Group, India
789X Single
2105X Thread
5368X 8 CPU,
32 Thread
79
Lattice Boltzmann on GPU
BACKGROUND
Computer Measurement Group, India
80
2-D Square Lid Driven Cavity Problem
U
Moving Top Lid
Fluid
L
Y
X
L
 Flow is generated by continuously moving top lid at a constant
velocity.
Computer Measurement Group, India
81
Level 1
Time (ms)
MGUPS
520727.1
5.034192
Remarks
Simply ported CPU code to GPU.
Structure Node & Lattice in GPU Global Memory.
/*CPU Code*/
for(y=0; y<(ny-2); y++)
{
for(x=0; x<(nx-2); x++)
{
-}
}
/*GPU Code*/
/*for(int y=0; y<(ny-2); y++){*/
if(tid < (ny-2))
{
for(x=0; x<(nx-2); x++)
{
-}
}
 Replace outer loop Iterations with threads.
 Total Threads=(ny-2), Each thread working on (nx-2) grid points.
MGUPS = (GridSize x TimeIterations) / (Time x 1000000)
Computer Measurement Group, India
82
Level 1 (Cont.)
Computer Measurement Group, India
83
Level 2
Time (ms)
MGUPS
115742
22.64899
Remarks
Loop Collapsing
/*GPU Code Level 1*/
/*GPU Code with Loop Fusion*/
if(tid < (ny-2))
{
for(x=0; x<(nx-2); x++)
{
for(y=0; y<(ny-2); y++)-}
}
if(tid < ((ny-2)*(nx-2)))
{
y = (tid/(nx-2))+1;
x = (tid%(nx-2))+1;
-}
 Collapsing of 2 nested loops into one to exhibit massive
parallelism.
 Total threads=[(ny-2)*(nx-2)], Now each thread working on 1 grid
point.
Computer Measurement Group, India
84
About GPU Constant Memory
Tesla C2075
SM 1
SM 2
SM 14
Constant Memory
Global Memory
 Can be used for data that will not change over the course of kernel
execution.
 Define constant memory using __constant__
 cudMemcpyToSymbol will copy data to constant memory.
 Constant memory is cached.
 Constant memory is read-only.
 Just 64 KB.
Computer Measurement Group, India
85
Level 3
Time (ms)
MGUPS
113061.8
23.186
Remarks
Copied Lattice Structure in GPU Constant Memory
typedef struct Lattice{
int Cs[9];
int Lattice_velocities[9][2];
real_dt Lattice_constants[9][4];
real_dt ek_i[9][9];
real_dt w_k[9];
real_dt ac_i[9];
real_dt gamma9[9];
}Lattice;
__constant__ Lattice lattice_dev_const[1];
cudaMemcpyToSymbol(lattice_dev_const, lattice, sizeof(Lattice));
Computer Measurement Group, India
86
Level 4
Time (ms)
MGUPS
Remarks
40044.5
65.5
Coalesced Memory Access pattern for Node Structure
typedef struct Node /* AoS, (ny*nx) elements */
{
int Type;
real_dt Vel[2];
real_dt Density;
real_dt F[9];
real_dt Ftmp[9];
}Node;
Typ
e
Vel[2]
Densit
y
Grid Point 0
F[9]
Ftmp[9
]
Typ
e
Vel[2]
Densit
y
F[9]
Ftmp[9
]
Grid Point 1
Computer Measurement Group, India
87
Level 4 (Cont.)
Computer Measurement Group, India
88
Level 4 (Cont.)
T-0
T-1
(All Threads simultaneously accessing
Density)
Stride
Typ
e
Vel[2]
Densit
y
Grid Point 0
F[9]
Ftmp[9
]
Typ
e
Vel[2]
Densit
y
F[9]
Ftmp[9
]
Grid Point 1
Computer Measurement Group, India
89
Level 4 (Cont.)
T-0
Typ
e
T-1
Vel[2]
(All Threads simultaneously accessing
Density)
Stride Inefficient access of global memory
Densit
y
F[9]
Ftmp[9
]
Typ
e
Grid Point 0
T-0
T-1
Densit
y
Vel[2]
F[9]
Ftmp[9
]
Grid Point 1
(All Threads simultaneously accessing
Density)
Coalesced Access pattern
Typ
e
Typ
e
Vel[2]
Vel[2]
Densit
y
Densit
y
F[9]
F[9]
Ftmp[9
]
Ftmp[9
]
Efficient access of global memory
Computer Measurement Group, India
90
Level 4 (Cont.)
typedef struct Node /* AoS, (ny*nx) elements */
{
int Type;
real_dt Vel[2];
real_dt Density;
typedef struct Type{
real_dt F[9];
int *val;
real_dt Ftmp[9];
}Type;
}Node;
typedef struct Vel{
real_dt *val;
}Vel;
typedef struct Node_map{
typedef struct Density{
Type type;
real_dt *val;
Vel vel[2];
}Density;
Density density;
typedef struct F{
F f[9];
real_dt *val;
Ftmp ftmp[9];
}F;
}Node_dev;
typedef struct Ftmp{
real_dt *val;
}Ftmp;
Computer Measurement Group, India
91
Level 5
Time (ms)
MGUPS
14492.6
180.9
Remarks
Arithmetic Optimizations
for(int k=3; k<SPEEDS; k++){
//mk[k] = lattice_dev_const->gamma9[k]*mk[k];
//mk[k] = lattice_dev_const->gamma9[k] * mk[k] /
lattice_dev_const->w_k[k];
mk[k] = lattice_dev_const->gamma9_div_wk[k]*mk[k];
}
for(int i=0; i<SPEEDS; i++){
f_neq = 0.0;
for(int k=0; k<SPEEDS; k++)
{
//f_neq += ((lattice_dev_const->ek_i[k][i] * mk[k]) /
lattice_dev_const->w_k[k]);
f_neq += lattice_dev_const->ek_i[k][i]*mk[k];
}
}
Computer Measurement Group, India
92
Level 5 (Cont.)
Computer Measurement Group, India
93
Level 6
Time (ms)
MGUPS
8309.662109
315.468903
Remarks
Algorithmic Optimization
Global
barrier
Fn , vn , n
Collision
Ftmpn
Streaming
Fn
vn , n
 Collision stores Ftmp to GPU Global Memory.
 Streaming loads Ftmp from GPU Global Memory.
 Global Memory Load/Store operations are expensive.
Computer Measurement Group, India
94
Level 6 (Cont.)
Collision
Streaming
 Pull Ftmp from Neighbors needs Synchronization.
Computer Measurement Group, India
95
Level 6 (Cont.)
Collision
Streaming
 Instead Push Ftmp to Neighbors – No need of
Synchronization
Computer Measurement Group, India
96
Level 6 (Cont.)
Fn , vn , n
Ftmpn
vn , n
Ftmpn
Fn
 Collision & Streaming can be one kernel.
 Saves one Load/Store from/to Global Memory.
Computer Measurement Group, India
97
Optimizations Achieved on GPU using CUDA
Levels
Time (ms)
MGUPS
(Million Grid
Updates Per
Second)
1
520727.1
5.034192
2
115742
22.64899
3
113061.8
23.186
4
40044.5
65.5
5
14492.6
180.9
6
8309.662109
315.468903
Remarks
Simply ported CPU code to GPU.
Structure Node & Lattice in GPU Global Memory.
Loop Collapsing
Copied Lattice Structure in GPU Constant
Memory
Coalesced Memory Access pattern for Node
Structure
Arithmetic Optimizations
Algorithmic Optimization
 CUDA Card: Tesla C2075 (448 Cores, 14 SM, Fermi, Compute 2.0)
Computer Measurement Group, India
98
Recap
• Part – I
– A sample domain problem
– Hardware & Software
• Part – II Performance Optimization Case Studies
– Online Risk Management
– Lattice Boltzmann implementation
Computer Measurement Group, India
- 99 -
99
Closing Comments
•
OLTP applications seldom require HPC technologies
– Unless it is an application that needs to respond in microseconds
• Algo trading etc
•
Can HPC technologies be used to speed up my data-transformation (ETL/ELT) and reporting
workloads?
– Sure – you have to let go the ease of using 3rd party products & databases
• If you don’t want to –customizing a specific bottleneck process could help
– Stay tuned to companies innovating in this space –
• e.g SQREAM – implements databases operations on GPU’s
•
Investing in an HPC cluster and technologies not enough
– Also investing people who understand
• Underlying technologies
• Applications
Computer Measurement Group, India
- 100 -
100
Q&A
m.nambiar@tcs.com
www.cmgindia.org
Computer Measurement Group, India
101
Download