Automatic Parallelization of Simulation Code from Equation Based Simulation Languages

advertisement
Automatic Parallelization of
Simulation Code from Equation
Based Simulation Languages
Peter Aronsson,
Industrial phd student, PELAB SaS IDA
Linköping University, Sweden
Based on Licentiate presentation & CPC’03
Presentation
Peter Aronsson
Outline
•
•
•
•
•
•
•
Introduction
Task Graphs
Related work on Scheduling & Clustering
Parallelization Tool
Contributions
Results
Conclusion & Future Work
Peter Aronsson
Introduction
• Modelica
– Object Oriented, Equation Based, Modeling Language
• Modelica enable modeling and simulation of large
and complex multi-domain systems
• Large need for parallel computation
– To decrease time of executing simulations
– To make large models possible to simulate at all.
– To meet hard real time demands in hardware-in-theloop simulations
Peter Aronsson
Examples of large complex
systems in Modelica
Peter Aronsson
Modelica Example - DCmotor
Peter Aronsson
Modelica example
model DCMotor
import Modelica.Electrical.Analog.Basic.*;
import Modelica.Electrical.Sources.StepVoltage;
Resistor R1(R=10);
Inductor I1(L=0.1);
EMF emf(k=5.4);
Ground ground;
StepVoltage step(V=10);
Modelica.Mechanics.Rotational.Inertia load(J=2.25);
equation
connect(R1.n, I1.p);
connect(I1.n, emf.p);
connect(emf.n, ground.p);
connect(emf.flange_b, load.flange_a);
connect(step.p, R1.p);
connect(step.n, ground.p);
end DCMotor;
Peter Aronsson
Example – Flat set of Equations
R1.v = -R1.n.v+R1.p.v
0 = R1.n.i+R1.p.i
R1.i = R1.p.i
R1.i*R1.R = R1.v
I1.v = -I1.n.v+I1.p.v
0 = I1.n.i+I1.p.i
I1.i = I1.p.i
I1.L*I1.der(i) = I1.v
emf.v =-emf.n.v+emf.p.v
0 = emf.n.i+emf.p.i
emf.i = emf.p.i
emf.w = emf.flange_b.der(phi)
emf.k*emf.w = emf.v
emf.flange_b.tau = -emf.i*emf.k
ground.p.v = 0
step.v = -step.n.v+step.p.v
0 = step.n.i+step.p.i
step.i = step.p.i
step.signalSource.outPort.signal[1] = (if time < step.signalSource.p_startTime[1]
then 0
else step.signalSource.p_height[1])+step.signalSource.p_offset[1]
step.v = step.signalSource.outPort.signal[1]
load.flange_a.phi = load.phi
load.flange_b.phi = load.phi
load.w = load.der(phi)
load.a = load.der(w)
load.a*load.J = load.flange_a.tau+load.flange_b.tau
R1.n.v = I1.p.v
I1.p.i+R1.n.i = 0
I1.n.v = emf.p.v
emf.p.i+I1.n.i = 0
emf.n.v = step.n.v
step.n.v = ground.p.v
emf.n.i+ground.p.i+step.n.i = 0
emf.flange_b.phi = load.flange_a.phi
emf.flange_b.tau+load.flange_a.tau = 0
step.p.v = R1.p.v
R1.p.i+step.p.i = 0
load.flange_b.tau = 0
step.signalSource.y = step.signalSource.outPort.signal
Peter Aronsson
Plot of Simulation result
•load.flange_a.tau
•load.w
5
4
3
2
1
0. 5
1
Peter Aronsson
1. 5
2
Task Graphs
• Directed Acyclic Graph (DAG)
G = (V,E, t,c)
V – Set of nodes, representing computational tasks
E – Set of edges, representing communication of data
between tasks
t(v) – Execution cost for node v
c(i,j) – Communication cost for edge (i,j)
• Referred to as the delay model (macro dataflow
model)
Peter Aronsson
Small Task Graph Example
1
2
5
2
1
5
4
1
10
3
2
5
5
6
2
5
2
10
10
10
7
1
8
1
Peter Aronsson
Task Scheduling Algorithms
• Multiprocessor Scheduling Problem
– For each task, assign
• Starting time
• Processor assignment (P1,...PN)
– Goal: minimize execution time, given
• Precedence constraints
• Execution cost
• Communication cost
• Algorithms in literature
– List Scheduling approaches (ERT, FLB)
– Critical Path scheduling approaches (TDS, MCP)
• Categories: Fixed No. of Proc, fixed c and/or t, ...
Peter Aronsson
Granularity
• Granularity g = min(t(v))/max(c(i,j))
• Affects scheduling result
– E.g. TDS works best for high values of g, i.e. low
communication cost
• Solutions:
– Clustering algorithms
• IDEA: build clusters of nodes where nodes in the same cluster
are executed on the same processor
– Merging algorithms
• Merge tasks to increase computational cost.
Peter Aronsson
Task Clustering/Merging
Algorithms
• Task Clustering Problem:
– Build clusters of nodes such that parallel time decreases
– PT(n) = tlevel(n)+blevel(n)
– By zeroing edges, i.e. putting several nodes into the
same cluster => zero communication cost.
• Literature:
– Sarkars Internalization alg., Yangs DSC alg.
• Task Merging Problem
– Transform the Task Graph by merging nodes
• Literature: E.g. Grain Packing alg.
Peter Aronsson
Clustering v.s. Merging
1
2
10
5
1
2
5
0
0
4
1
0
4
1
0
5
2
0
7
1
6
2
10
5
5
3
2
2
1
3
2
2
1
5
5
2
6
2
10
10
7
1
10
8
1
1
2
10
5
0
8
1
2,5,6
4
10
10
Clustered Task Graph
3,6
6
10
7
1
8
1
Merged Task Graph
Peter Aronsson
DSC algorithm
1. Initially, put each node a separate cluster.
2. Traverse Task Graph
–
•
•
Merge clusters as long as Parallel Time does
not increase.
Low complexity O((n+e) log n)
Previously used by Andersson in
ObjectMath (PELAB)
Peter Aronsson
Numerical
solver
r...
r...
l...
Modelica Compilation
b...
b...
a...
Flat modelica (.mof)
r...
a...
a...
inertial
yx
r...
b0
c...
b...
b...
r...
a...
Opt.
Rhs
calculations
C code
b...
b...
a...
b...
b...
r...
a...
b...
c...
r...
Modelica
semantics
Equation
system
(DAE)
Modelica model (.mo)
Structure of simulation code:
for t=0;t<stopTime;t+=stepSize {
x_dot[t+1] = f(x_dot[t],x[t],t);
x[t+1] = ODESolver(x_dot[t+1]);
}
Peter Aronsson
Optimizations on equations
• Simplification of equations
E.g. a=b, b=c eliminate => b
• BLT transformation, i.e. topological sorting into strongly
connected components
(BLT = Block Lower Triangular form) a
0
b
c
d
e
• Index reduction, Index is how many times an equation
needs to be differentiated in order to solve the equation
system.
• Mixed Mode /Inline Integration, methods of optimizing
equations by reducing size of equation systems
Peter Aronsson
Generated C Code Content
• Assignment statements
• Arithmetic expressions (+,-,*,/), if-expressions
• Function calls
– Standard Math functions
• Sin, Cos, Log
– Modelica Functions
• User defined, side effect free
– External Modelica Functions
• In External lib, written in Fortran or C
– Call function for solving subsystems of equations
• Linear or non-linear
• Example Application
– Robot simulation has 27 000 lines of generated C code
Peter Aronsson
Parallelization Tool Overview
Model
.mo
Modelica
Compiler
C code
C compiler
Solver
lib
Parallelizer
Parallel
C code
MPI
lib
C compiler
Seq exe
Parallel exe
Peter Aronsson
Parallelization Tool Internal
Structure
Sequential C code
Parser
Symbol
Table
Task Graph
Builder
Scheduler
Code Generator
Parallel C code
Peter Aronsson
Debug & Statistics
Task Graph building
•
First graph: corresponds to individual
arithmetic operations, assignments, function
calls and variable definitions in the C code
• Second graph: Clusters of tasks from first task
graph
Example:
defs
a
b
c
+
+
foo
*
/
*
-
d
+,-,*
foo
Peter Aronsson
+,*
/,-
Investigated Scheduling
Algorithms
• Parallelization Tool
– TDS (Task Duplications Scheduling Algorithm)
– Pre – Clustering Method
– Full Task Duplication Method
• Experimental Framework (Mathematica)
–
–
–
–
–
ERT
DSC
TDS
Full Task Duplication Method
Task Merging approaches (Graph Rewrite Systems)
Peter Aronsson
Method 1:Pre Clustering
algorithm
– buildCluster(n:node, l:list of nodes,
size:Integer)
– Adds n to a new cluster
– Repeatedly adds nodes until the
size(cluster)=size
–
–
–
–
–
Children to n
One in-degree children to cluster
Siblings to n
Parents to n
Arbitrary nodes
Peter Aronsson
Managing cycles
• When adding a node to a
cluster the resulting graph
might have cycles
• Resulting graph when
clustering a and b is cyclic
since you can reach {a,b}
from c
• Resulting graph not a DAG
– Can not use standard scheduling
algorithms
Peter Aronsson
a
c
d
b
e
Pre Clustering Results
• Did not produce Speedup
– Introduced far too many dependencies in
resulting task graph
– Sequentialized schedule
• Conclusion:
– For fine grained task graphs:
• Need task duplication in such algorithm to succeed
Peter Aronsson
Method 2: Full Task Duplication
• For each node:n with successor(n)={}
– Put all pred(n) in one cluster
• Repeat for all nodes in cluster
– Rationale: If depth of graph limited, task duplication
will be kept at reasonable level and cluster size
reasonable small.
– Works well when communication cost >> execution
cost
Peter Aronsson
Full Task Duplication (2)
•
Merging clusters
1. Merge clusters with load balancing strategy,
without increasing maximum cluster size
2. Merge clusters with greatest number of
common nodes
•
Repeat (2) until number of processors
requirement is met
Peter Aronsson
Full Task Duplication Results
• Computed measurements
– Execution cost of largest cluster +
communication cost
• Measured speedup
– Executed on PC Linux
cluster SCI network interface,
using SCAMPI
Peter Aronsson
Robot Example Computed
Speedup
• Mixed Mode / Inline Integration
Speedup
2
1.75
1.5
1.25
1
0.75
0.5
0.25
1
Without MM/II
c 1000
c 100
c 10
2
Speedup
2
1.75
1.5
1.25
1
0.75
0.5
0.25
c 1000
c 100
c 10
# Proc
1 2
4
With MM/II
Peter Aronsson
9
# Proc
Thermofluid pipe executed on PC
Cluster
• Pressurewavedemo in Thermofluid package
50 discretization points
Speedup
2
1.75
1.5
1.25
1
0.75
0.5
0.25
1 2
4
8
Peter Aronsson
16
# Proc
Thermofluid pipe executed on PC
Cluster
• Pressurewavedemo in Thermofluid package
100 discretization points
Speedup
3
2.5
2
1.5
1
0.5
1 2
4
8
Peter Aronsson
16
# Proc
Task Merging using GRS
• Idea: A set of simple rules to transform a
task graph to increase its granularity (and
decrease Parallel Time)
• Use top level (and bottom level) as metric:
• Parallel Time = max tlevel + max blevel
Peter Aronsson
Rule 1
• Merging a single child with only one parent.
p
p’
c
• Motivation: The merge does not decrease amount
of parallelism in the task graph. And granularity
can possibly increase.
Peter Aronsson
Rule 2
• Merge all parents of a node together with the
node itself.
p1
p2
…
pn
c’
c
• Motivation: If the top level does not increase by
the merge the resulting task will increase in size,
potentially increasing granularity.
Peter Aronsson
Rule 3
• Duplicate parent and merge into each child node
p
c1’
c1
c2
…
c2’ … cn’
cn
• Motivation: As long as each child’s tlevel does not
increase, duplicating p into the child will reduce the
number of nodes and increase granularity.
Peter Aronsson
Rule 4
• Merge siblings into a single node as long as a
parameterized maximum execution cost is not exceeded.
p1
p´
p2 … pn
Pk+1 … pn
c
c
• Motivation: This rule can be useful if several small
predecessor nodes exist and a larger predecessor node
which prevents a complete merge. Does not guarantee
decrease of PT.
Peter Aronsson
Results – Example
• Task graph from
Modelica simulation
code
– Small example from the
mechanical domain.
– About 100 nodes built on
expression level,
originating from 84
equations & variables
Peter Aronsson
Result Task Merging example
• B=1, L=1
Peter Aronsson
Result Task Merging example
– B=1, L=10
– B=1, L=100
Peter Aronsson
Conclusions
• Pre Clustering approach did not work well
for the fine grained task graphs produced by
our parallelization tool
• FTD Method
– Works reasonable well for some examples
• However, in general:
– Need for better scheduling/clustering
algorithms for fine grained task graphs
Peter Aronsson
Conclusions (2)
• Simple delay model may not be enough
– More advanced model require more complex
scheduling and clustering algorithms
• Simulation code from equation based
models
– Hard to extract parallelism from
– Need new optimization methods on DAE:s or
ODE:s to increase parallelism
Peter Aronsson
Conclusions Task Merging using
GRS
• A task merging algorithm using GRS have been
proposed
– Four rules with simple patterns => fast pattern
matching
• Can easily be integrated in existing scheduling
tools.
• Successfully merges tasks considering
– Bandwidth & Latency
– Task duplication
– Merging criterion: decrease Parallel Time, by
decreasing tlevel (PT)
• Tested on examples from simulation code
Peter Aronsson
Future Work
• Designing and Implementing Better
Scheduling and Clustering Algorithms
– Support for more advanced task graph models
– Work better for high granularity values
• Try larger examples
• Test on different architectures
– Shared Memory machines
– Dual processor machines
Peter Aronsson
Future Work (2)
• Heterogeneous multiprocessor systems
– Mixed DSP processors, RISC,CISC, etc.
• Enhancing Modelica language with data
parallelism
– e.g. parallel loops, vector operations
• Parallelize e.g. combined PDE and ODE
problems in Modelica.
• Using e.g. SCALAPACK for solving subsystems
of linear equations. How to integrate into
scheduling algorithms?
Peter Aronsson
Download