Performance of a Multi-Paradigm Messaging Runtime on Multicore Systems Xiaohong Qiu

advertisement
Performance of a Multi-Paradigm
Messaging Runtime on Multicore Systems
Poster at Grid 2007
Omni Austin Downtown Hotel Austin Texas
September 19 2007
Xiaohong Qiu
Research Computing UITS, Indiana University Bloomington IN
Geoffrey Fox, H. Yuan, Seung-Hee Bae
Community Grids Laboratory, Indiana University Bloomington IN 47404
George Chrysanthakopoulos, Henrik Frystyk Nielsen
Microsoft Research, Redmond WA
Presented by Geoffrey Fox gcf@indiana.edu
http://www.infomall.org
1
Motivation
• Exploring possible applications for tomorrow’s
multicore chips (especially clients) with 64 or
more cores (about 5 years)
• One plausible set of applications is data-mining
of Internet and local sensors
• Developing Library of efficient data-mining
algorithms
– Clustering (GIS, Cheminformatics) and Hidden
Markov Methods (Speech Recognition)
• Choose algorithms that can be parallelized well
2
Approach
• Need 3 forms of parallelism
– MPI Style
– Dynamic threads as in pruned search
– Coarse Grain functional parallelism
• Do not use an integrated language approach as in
Darpa HPCS
• Rather use “mash-ups” or “workflow” to link
together modules in optimized parallel libraries
• Use Microsoft CCR/DSS where DSS is mash-up
model built from CCR and CCR supports MPI
or Dynamic threads
3
Microsoft CCR
• Supports exchange of messages between threads using named
ports
• FromHandler: Spawn threads without reading ports
• Receive: Each handler reads one item from a single port
• MultipleItemReceive: Each handler reads a prescribed number of
items of a given type from a given port. Note items in a port can
be general structures but all must have same type.
• MultiplePortReceive: Each handler reads a one item of a given
type from multiple ports.
• JoinedReceive: Each handler reads one item from each of two
ports. The items can be of different type.
• Choice: Execute a choice of two or more port-handler pairings
• Interleave: Consists of a set of arbiters (port -- handler pairs) of 3
types that are Concurrent, Exclusive or Teardown (called at end
for clean up). Concurrent arbiters are run concurrently but
exclusive handlers are
• http://msdn.microsoft.com/robotics/
4
Preliminary Results
• Parallel Deterministic Annealing Clustering in
C# with speed-up of 7 on Intel 2 quadcore
systems
• Analysis of performance of Java, C, C# in
MPI and dynamic threading with XP, Vista,
Windows Server, Fedora, Redhat on
Intel/AMD systems
• Study of cache effects coming with MPI
thread-based parallelism
• Study of execution time fluctuations in
Windows (limiting speed-up to 7 not 8!)
Machines Used
AMD4: HPxw9300 workstation, 2 AMD Opteron CPUs Processor 275 at 2.19GHz, 4 cores
L2 Cache 4x1MB (summing both chips), Memory 4GB,
XP Pro 64bit , Windows Server, Red Hat
C# Benchmark Computational unit: 1.388 µs
Intel4: Dell Precision PWS670, 2 Intel Xeon Paxville CPUs at 2.80GHz, 4 cores
L2 Cache 4x2MB, Memory 4GB,
XP Pro 64bit
C# Benchmark Computational unit: 1.475 µs
Intel8a: Dell Precision PWS690, 2 Intel Xeon CPUs E5320 at 1.86GHz, 8 cores
L2 Cache 4x4M, Memory 8GB,
XP Pro 64bit
C# Benchmark Computational unit: 1.696 µs
Intel8b: Dell Precision PWS690, 2 Intel Xeon CPUs E5355 at 2.66GHz, 8 cores
L2 Cache 4x4M, Memory 4GB,
Vista Ultimate 64bit, Fedora 7
C# Benchmark Computational unit: 1.188 µs
Intel8c: Dell Precision PWS690, 2 Intel Xeon CPUs E5345 at 2.33GHz, 8 cores
L2 Cache 4x4M, Memory 8GB,
Red Hat 5.0, Fedora 7
CCR Overhead for a computation
of 27.76 µs between messaging
AMD4: 4 Core
Number of Parallel Computations
(μs)
Pipeline
Spawned
Shift
Two Shifts
1
1.76
2
4.52
4.48
7.44
3
4.4
4.62
8.9
4
4.84
4.8
10.18
7
1.42
0.84
12.74
8
8.54
8.94
23.92
Pipeline
Shift
Exchange
As Two
Shifts
Exchange
3.7
5.88
6.8
6.52
8.42
6.74
9.36
8.54
2.74
14.98
11.16
14.1
15.9
19.14
11.78
22.6
10.32
15.5
16.3
11.3
21.38
Rendez
vous
(MPI)
CCR Overhead for a computation of
29.5 µs between messaging
Intel4: 4 Core
(μs)
1
2
3
4
7
8
3.32
8.3
9.38
10.18
3.02
12.12
Shift
8.3
9.34
10.08
4.38
13.52
Two Shifts
17.64
19.32
21
28.74
44.02
9.36 12.08
13.02
13.58
16.68
25.68
Shift
12.56
13.7
14.4
4.72
15.94
Exchange As
Two Shifts
23.76
27.48
30.64
22.14
36.16
Exchange
18.48
24.02
25.76
20
34.56
Pipeline
Spawned
Rendez
vous
MPI
Number of Parallel Computations
Pipeline
CCR Overhead for a computation of
23.76 µs between messaging
Intel8b: 8 Core
(μs)
Pipeline
Spawned
Rendez
vous
MPI
Number of Parallel Computations
1
1.58
2
2.44
3
3
4
2.94
7
4.5
8
5.06
Shift
2.42
3.2
3.38
5.26
5.14
Two Shifts
Pipeline
4.94
3.96
5.9
4.52
6.84
5.78
14.32 19.44
6.82 7.18
Shift
Exchange As
Two Shifts
4.46
6.42
5.86
10.86 11.74
7.4
11.64
14.16 31.86 35.62
Exchange
6.94
11.22
13.3
2.48
18.78 20.16
MPI Exchange Latency in µs with 500,000 stages (20-30 µs computation between messaging)
Machine
OS
Runtime
Grains
Parallelism
MPI Exchange
Latency
Intel8c:gf12
Redhat
MPJE
Process
8
181
MPICH2
Process
8
40.0
MPICH2: Fast
Process
8
39.3
Nemesis
Process
8
4.21
MPJE
Process
8
157
mpiJava
Process
8
111
MPICH2
Process
8
64.2
Vista
MPJE
Process
8
170
Fedora
MPJE
Process
8
142
Fedora
mpiJava
Process
8
100
Vista
CCR
Thread
8
20.2
XP
MPJE
Process
4
185
Redhat
MPJE
Process
4
152
Redhat
mpiJava
Process
4
99.4
Redhat
MPICH2
Process
4
39.3
XP
CCR
Thread
4
16.3
XP
CCR
Thread
4
25.8
Intel8c:gf20
Intel8b
AMD4
Intel4
Fedora
30
Time Microseconds
AMD Exch
25
AMD Exch as 2 Shifts
AMD Shift
20
15
10
5
Stages (millions)
0
0
2
4
6
8
10
Overhead (latency) of AMD4 PC with 4 execution threads on MPI style Rendezvous
Messaging for Shift and Exchange implemented either as two shifts or as custom CCR
pattern
70
Time Microseconds
60
Intel Exch
50
Intel Exch as 2 Shifts
Intel Shift
40
30
20
10
Stages (millions)
0
0
2
4
6
8
10
Overhead (latency) of Intel8b PC with 8 execution threads on MPI style Rendezvous
Messaging for Shift and Exchange implemented either as two shifts or as custom
CCR pattern
MPICH mpiJava MPJE
MPI Exchange Latency on AMD4
Exchange Overhead on DoubleAMD machine
250
200
WindowsXP (MPJE)
150
RedHat (MPJE)
RedHat (mpiJava)
RedHat (MPICH2)
100
50
Stages (millions)
0
0
0
2000000
2
4000000
4
6000000
6
8000000
8
100000
10
Cache Line Interference
Machine
OS
Run
Time
Intel8b
Vista
CCR
Vista
Vista
Fedora
XP
CCR
XP
C# CCR
Time µs versus Thread Array Separation (unit is 8 bytes)
1
4
8
1024
Mean
Std/
Mean
Std/
Mean Std/
Mean Std/
Mean
Mean
Mean
Mean
8.03
.029
3.04
.059
0.884 .0051
0.884 .0069
C# Locks
C
C
C#
13.0
13.4
1.50
10.6
.0095
.0047
.01
.033
3.08
1.69
0.69
4.16
.0028
.0026
.21
.041
0.883
0.66
0.307
1.27
.0043
.029
.0045
.051
0.883
0.659
0.307
1.43
.0036
.0057
.016
.049
C#
16.6
.016
4.31
.0067
1.27
.066
1.27
.054
C
C
C# CCR
C# Locks
C
16.9
0.441
8.58
8.72
5.65
.0016
.0035
.0080
.0036
.020
2.27
0.423
2.62
2.42
2.69
.0042
.0031
.081
0.01
.0060
0.946
0.423
0.839
0.836
1.05
.056
.0030
.0031
.0016
.0013
0.946
0.423
0.838
0.836
1.05
.058
.032
.0031
.0013
.0014
Intel8b
Intel8b
Intel8b
Intel8a
Intel
8a
Intel8a
Intel8c
AMD4
AMD4
AMD4
Locks
XP
Redhat
WinSrvr
WinSrvr
WinSrvr
• One thread on each core
• Thread i stores sum in A(i) is separation 1 – no variable access interference but cache line
interference
• Thread i stores sum in A(X*i) is separation X
• Serious degradation if X < 64 bytes (8 words) and Vista or XP
• A is a double (8 bytes)
Deterministic Annealing
• See K. Rose, "Deterministic Annealing for
Clustering, Compression, Classification,
Regression, and Related Optimization
Problems," Proceedings of the IEEE, vol. 80, pp.
2210-2239, November 1998
• Parallelization is similar to ordinary K-Means as
we are calculating global sums which are
decomposed into local averages and then
summed over components calculated in each
processor
• Many similar data mining algorithms (such as
annealing for E-M expectation maximization)
which have high parallel efficiency and avoid
local minima
Clustering by Deterministic Annealing
• Use Physics Analogy for Clustering
Deterministically find cluster centers yj using “mean field
approximation” – could use slower Monte Carlo
Annealing avoids local minima
Parallel Multicore
Deterministic Annealing Clustering
Parallel Overhead
on 8 Threads Intel 8b
0.45
10 Clusters
0.4
Speedup = 8/(1+Overhead)
0.35
Overhead = Constant1 + Constant2/n
Constant1 = 0.05 to 0.1 (Client Windows)
0.3
0.25
20 Clusters
0.2
0.15
0.1
0.05
10000/(Grain Size n = points per core)
0
0
0.5
1
1.5
2
2.5
3
3.5
4
Parallel Multicore
Deterministic Annealing Clustering
Parallel Overhead for large (2M points) Indiana Census clustering
on 8 Threads Intel 8b
0.250
0.200
overhead
“Constant1”
0.150
0.100
0.050
Increasing number of clusters decreases
communication/memory bandwidth overheads
0.000
0
5
10
15
20
#cluster
25
30
35
Intel 8b C# with 1 Cluster: Vista Scaled
Run Time for Clustering Kernel
• Run time for same workload per thread normalized by number of data
points
• Expect Run Time independent of Number of threads if not for parallel and
memory bandwidth overheads
1 Cluster(time vs #thread)
• Work per data point proportional to number of clusters
17
Run Time Secs
16.5
16
10,000 Datapts
15.5
15
50,000 Datapts
14.5
500,000 Datapts
14
13.5
13
12.5
12
11.5
11
Number of Threads
10.5
0
1
2
3
4
5
6
7
8
Intel 8b C# with 80 Clusters: Vista
Scaled Run Time for Clustering Kernel
• Work per data point proportional to number of clusters
so memory bandwidth and parallel overheads
80 Clusters(time vs #thread)
decrease as # clusters increase
Run Time Secs
11
10.75
10,000 Datapts
10.5
50,000 Datapts
10.25
500,000 Datapts
10
9.75
9.5
9.25
9
8.75
8.5
8.25
Number of Threads
8
0
1
2
3
4
5
6
7
8
Intel 8c C with 80 Clusters: Redhat Run
Time Fluctuations for Clustering Kernel
• This is average of standard deviation of run time of the
80 Cluster(ratio of std to
time vs #thread)
8 threads between messaging
synchronization
points
0.02
Standard Deviation/Run Time
0.01
10,000 Datapts
50,000 Datapts
500,000 Datapts
Number of Threads
0
0
1
2
3
4
5
6
7
8
Intel 8c C with 80 Clusters: Redhat
Scaled Run Time for Clustering Kernel
• Work per data point proportional to number of clusters
so memory bandwidth 80
and
parallel overheads
Clusters(time vs #thread)
decrease as # clusters increase
9.3
Run Time Secs
10,000 Datapts
50,000 Datapts
9.2
500,000 Datapts
Number of Threads
9.1
0
1
2
3
4
5
6
7
8
std / time
Intel 8b C# with 1 Cluster: Vista Run
Time Fluctuations for Clustering Kernel
• This is average of standard deviation of run time of the
1 Cluster(ratio of std to time vs #thread)
8 threads between messaging synchronization points
0.2
Standard Deviation/Run Time
0.1
10,000 Datapts
50,000 Datapts
500,000 Datapts
Number of Threads
0
0
1
2
3
4
5
6
7
8
Intel 8b C# with 80 Clusters: Vista Run
Time Fluctuations for Clustering Kernel
• This is average of standard deviation of run time of the
80 Cluster(ratio of std to time vs #thread)
8 threads between messaging synchronization points
0.1
Standard Deviation/Run Time
10,000 Datpts
50,000 Datapts
0.05
500,000 Datapts
Number of Threads
0
0
1
2
3
4
5
6
7
8
DSS Section
• We view system as a collection of services
– in this case
– One to supply data
– One to run parallel clustering
– One to visualize results – in this by spawning
a Google maps browser
– Note we are clustering Indiana census data
• DSS is convenient as built on CCR
Average run time (microseconds)
350
DSS Service Measurements
300
250
200
150
100
50
0
1
10
100
1000
10000
Timing of HP Opteron Multicore as aRound
functiontrips
of number of simultaneous twoway service messages processed (November 2006 DSS Release)

CGL Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better
PC07Intro gcf@indiana.edu
30
Clustering algorithm annealing by decreasing distance scale and gradually finds more
clusters as resolution improved
Here we see increasing to 30 as algorithm progresses
Download