Performance Measurements of CCR and MPI on Multicore Systems Xiaohong Qiu

advertisement
Performance Measurements of CCR and
MPI on Multicore Systems
Expanded from a Poster at Grid 2007 Austin Texas
September 21 2007
Xiaohong Qiu
Research Computing UITS, Indiana University Bloomington IN
Geoffrey Fox, H. Yuan, Seung-Hee Bae
Community Grids Laboratory, Indiana University Bloomington IN 47404
George Chrysanthakopoulos, Henrik Frystyk Nielsen
Microsoft Research, Redmond WA
Presented by Geoffrey Fox gcf@indiana.edu
http://www.infomall.org
1
Motivation
• Exploring possible applications for tomorrow’s
multicore chips (especially clients) with 64 or
more cores (about 5 years)
• One plausible set of applications is data-mining
of Internet and local sensors
• Developing Library of efficient data-mining
algorithms
– Clustering (GIS, Cheminformatics) and Hidden
Markov Methods (Speech Recognition)
• Choose algorithms that can be parallelized well
2
Approach
• Need 3 forms of parallelism
– MPI Style
– Dynamic threads as in pruned search
– Coarse Grain functional parallelism
• Do not use an integrated language approach as in
Darpa HPCS
• Rather use “mash-ups” or “workflow” to link
together modules in optimized parallel libraries
• Use Microsoft CCR/DSS where DSS is mashup/workflow model built from CCR and CCR
supports MPI or Dynamic threads
3
Microsoft CCR
• Supports exchange of messages between threads using named
ports
• FromHandler: Spawn threads without reading ports
• Receive: Each handler reads one item from a single port
• MultipleItemReceive: Each handler reads a prescribed number of
items of a given type from a given port. Note items in a port can
be general structures but all must have same type.
• MultiplePortReceive: Each handler reads a one item of a given
type from multiple ports.
• JoinedReceive: Each handler reads one item from each of two
ports. The items can be of different type.
• Choice: Execute a choice of two or more port-handler pairings
• Interleave: Consists of a set of arbiters (port -- handler pairs) of 3
types that are Concurrent, Exclusive or Teardown (called at end
for clean up). Concurrent arbiters are run concurrently but
exclusive handlers are
• http://msdn.microsoft.com/robotics/
4
Preliminary Results
• Parallel Deterministic Annealing Clustering in
C# with speed-up of 7 on Intel 2 quadcore
systems
• Analysis of performance of Java, C, C# in
MPI and dynamic threading with XP, Vista,
Windows Server, Fedora, Redhat on
Intel/AMD systems
• Study of cache effects coming with MPI
thread-based parallelism
• Study of execution time fluctuations in
Windows (limiting speed-up to 7 not 8!)
Machines Used
AMD4: HPxw9300 workstation, 2 AMD Opteron CPUs Processor 275 at 2.19GHz, 4 cores
L2 Cache 4x1MB (summing both chips), Memory 4GB,
XP Pro 64bit , Windows Server, Red Hat
C# Benchmark Computational unit: 1.388 µs
Intel4: Dell Precision PWS670, 2 Intel Xeon Paxville CPUs at 2.80GHz, 4 cores
L2 Cache 4x2MB, Memory 4GB,
XP Pro 64bit
C# Benchmark Computational unit: 1.475 µs
Intel8a: Dell Precision PWS690, 2 Intel Xeon CPUs E5320 at 1.86GHz, 8 cores
L2 Cache 4x4M, Memory 8GB,
XP Pro 64bit
C# Benchmark Computational unit: 1.696 µs
Intel8b: Dell Precision PWS690, 2 Intel Xeon CPUs E5355 at 2.66GHz, 8 cores
L2 Cache 4x4M, Memory 4GB,
Vista Ultimate 64bit, Fedora 7
C# Benchmark Computational unit: 1.188 µs
Intel8c: Dell Precision PWS690, 2 Intel Xeon CPUs E5345 at 2.33GHz, 8 cores
L2 Cache 4x4M, Memory 8GB,
Red Hat 5.0, Fedora 7
Basic Performance of CCR
CCR Overhead for a computation
of 27.76 µs between messaging
AMD4: 4 Core
Number of Parallel Computations
(μs)
Pipeline
Spawned
Shift
Two Shifts
1
1.76
2
4.52
4.48
7.44
3
4.4
4.62
8.9
4
4.84
4.8
10.18
7
1.42
0.84
12.74
8
8.54
8.94
23.92
Pipeline
Shift
Exchange
As Two
Shifts
Exchange
3.7
5.88
6.8
6.52
8.42
6.74
9.36
8.54
2.74
14.98
11.16
14.1
15.9
19.14
11.78
22.6
10.32
15.5
16.3
11.3
21.38
Rendez
vous
(MPI)
CCR Overhead for a computation of
29.5 µs between messaging
Intel4: 4 Core
(μs)
1
2
3
4
7
8
3.32
8.3
9.38
10.18
3.02
12.12
Shift
8.3
9.34
10.08
4.38
13.52
Two Shifts
17.64
19.32
21
28.74
44.02
9.36 12.08
13.02
13.58
16.68
25.68
Shift
12.56
13.7
14.4
4.72
15.94
Exchange As
Two Shifts
23.76
27.48
30.64
22.14
36.16
Exchange
18.48
24.02
25.76
20
34.56
Pipeline
Spawned
Rendez
vous
MPI
Number of Parallel Computations
Pipeline
CCR Overhead for a computation of
23.76 µs between messaging
Intel8b: 8 Core
(μs)
Pipeline
Spawned
Rendez
vous
MPI
Number of Parallel Computations
1
1.58
2
2.44
3
3
4
2.94
7
4.5
8
5.06
Shift
2.42
3.2
3.38
5.26
5.14
Two Shifts
Pipeline
4.94
3.96
5.9
4.52
6.84
5.78
14.32 19.44
6.82 7.18
Shift
Exchange As
Two Shifts
4.46
6.42
5.86
10.86 11.74
7.4
11.64
14.16 31.86 35.62
Exchange
6.94
11.22
13.3
2.48
18.78 20.16
30
Time Microseconds
AMD Exch
25
AMD Exch as 2 Shifts
AMD Shift
20
15
10
5
Stages (millions)
0
0
2
4
6
8
10
Overhead (latency) of AMD4 PC with 4 execution threads on MPI style Rendezvous
Messaging for Shift and Exchange implemented either as two shifts or as custom CCR
pattern
70
Time Microseconds
60
Intel Exch
50
Intel Exch as 2 Shifts
Intel Shift
40
30
20
10
Stages (millions)
0
0
2
4
6
8
10
Overhead (latency) of Intel8b PC with 8 execution threads on MPI style Rendezvous
Messaging for Shift and Exchange implemented either as two shifts or as custom
CCR pattern
Basic Performance of MPI for
C and Java
MPI Exchange Latency in µs with 500,000 stages (20-30 µs computation between messaging)
Machine
OS
Runtime
Grains
Parallelism
MPI Exchange
Latency
Intel8c:gf12
Redhat
MPJE
Process
8
181
MPICH2
Process
8
40.0
MPICH2: Fast
Process
8
39.3
Nemesis
Process
8
4.21
MPJE
Process
8
157
mpiJava
Process
8
111
MPICH2
Process
8
64.2
Vista
MPJE
Process
8
170
Fedora
MPJE
Process
8
142
Fedora
mpiJava
Process
8
100
Vista
CCR
Thread
8
20.2
XP
MPJE
Process
4
185
Redhat
MPJE
Process
4
152
Redhat
mpiJava
Process
4
99.4
Redhat
MPICH2
Process
4
39.3
XP
CCR
Thread
4
16.3
XP
CCR
Thread
4
25.8
Intel8c:gf20
Intel8b
AMD4
Intel4
Fedora
MPICH mpiJava MPJE
MPI Shift Latency on AMD4
Shift Overhead on DoubleAMD machine
120
100
WindowsXP (MPJE)
RedHat (MPJE)
RedHat (mpiJava)
RedHat (MPICH2)
80
60
40
20
Stages (millions)
0
0
0
2000000
2
4000000
4
6000000
6
8000000
8
1000000
10
MPICH mpiJava MPJE
MPI Exchange Latency on AMD4
Exchange Overhead on DoubleAMD machine
250
200
WindowsXP (MPJE)
150
RedHat (MPJE)
RedHat (mpiJava)
RedHat (MPICH2)
100
50
Stages (millions)
0
0
0
2000000
2
4000000
4
6000000
6
8000000
8
100000
10
MPICH Nemesis MPJE
Overhead
gf12 (RedHat)
machine
MPI ExchangeExchange
Latency
on on
Intel8c
RedHat
250
200
150
MPJE
MPICH2
MPICH2:Nemesis
MPICH2:enable-fast
100
50
Stages (millions)
0
00
2000000
2
4000000
4
6000000
6
8000000
8
100000
10
Cache Line Interference
•
•
•
•
•
Cache
Line
Interference
Early implementations of our clustering algorithm
showed large fluctuations due to the cache line
interference effect discussed here and on next slide
in a simple case
We have one thread on each core each calculating a
sum of same complexity storing result in a common
array A with different cores using different array
locations
Thread i stores sum in A(i) is separation 1 – no
variable access interference but cache line
interference
Thread i stores sum in A(X*i) is separation X
Serious degradation if X < 8 (64 bytes) with Windows
– Note A is a double (8 bytes)
– Less interference effect with Linux – especially Red Hat
Cache Line Interference
•
•
•
Machine
OS
Run
Time
Intel8b
Intel8b
Intel8b
Intel8b
Intel8a
Intel8a
Intel8a
Intel8c
AMD4
AMD4
AMD4
AMD4
AMD4
AMD4
Vista
Vista
Vista
Fedora
XP CCR
XP Locks
XP
Red Hat
WinSrvr
WinSrvr
WinSrvr
XP
XP
XP
C# CCR
C# Locks
C
C
C#
C#
C
C
C# CCR
C# Locks
C
C# CCR
C# Locks
C
Time µs versus Thread Array Separation (unit is 8 bytes)
1
4
8
1024
Mean Std/
Mean
Std/
Mean Std/
Mean Std/
Mean
Mean
Mean
Mean
8.03
.029
3.04
.059
0.884 .0051
0.884 .0069
13.0
.0095 3.08
.0028
0.883 .0043
0.883 .0036
13.4
.0047 1.69
.0026
0.66
.029
0.659 .0057
1.50
.01
0.69
.21
0.307 .0045
0.307 .016
10.6
.033
4.16
.041
1.27
.051
1.43
.049
16.6
.016
4.31
.0067
1.27
.066
1.27
.054
16.9
.0016 2.27
.0042
0.946 .056
0.946 .058
0.441 .0035 0.423
.0031
0.423 .0030
0.423 .032
8.58
.0080 2.62
.081
0.839 .0031
0.838 .0031
8.72
.0036 2.42
0.01
0.836 .0016
0.836 .0013
5.65
.020
2.69
.0060
1.05
.0013
1.05
.0014
8.05
0.010
2.84
0.077
0.84
0.040
0.840 0.022
8.21
0.006
2.57
0.016
0.84
0.007
0.84
0.007
6.10
0.026
2.95
0.017
1.05
0.019
1.05
0.017
Note measurements at a separation of 8 (and values between 8 and 1024 not shown) are
essentially identical
Measurements at 7 (not shown) are higher than that at 8 (except for Red Hat which
shows essentially no enhancement at X<8)
If effects due to co-location of thread variables in a 64 byte cache line, the array must be
aligned with cache boundaries
–
In early implementations we found poor X=8 performance expected in words of A split across
cache lines
Clustering Problem
Deterministic Annealing
• See K. Rose, "Deterministic Annealing for Clustering,
Compression, Classification, Regression, and Related
Optimization Problems," Proceedings of the IEEE, vol. 80,
pp. 2210-2239, November 1998
• Parallelization is similar to ordinary K-Means as we are
calculating global sums which are decomposed into local
averages and then summed over components calculated in
each processor
• Many similar data mining algorithms (such as annealing for
E-M expectation maximization) which have high parallel
efficiency and avoid local minima
• For more details see
– http://grids.ucs.indiana.edu/ptliupages/presentations/Grid
2007PosterSept19-07.ppt and
– http://grids.ucs.indiana.edu/ptliupages/presentations/PC2
007/PC07BYOPA.ppt
Parallel Multicore
Deterministic Annealing Clustering
Parallel Overhead
on 8 Threads Intel 8b
0.45
10 Clusters
0.4
Overhead = Constant1 + Constant2/n
Speedup = 8/(1+Overhead)
0.35
Constant1 = 0.05 to 0.1 (Client Windows) due to thread
runtime fluctuations
0.3
0.25
20 Clusters
0.2
0.15
0.1
0.05
10000/(Grain Size n = points per core)
0
0
0.5
1
1.5
2
2.5
3
3.5
4
Parallel Multicore
Deterministic Annealing Clustering
Parallel Overhead for large (2M points) Indiana Census clustering
on 8 Threads Intel 8b
This fluctuating overhead due to 5-10% runtime fluctuations between threads
0.250
0.200
overhead
“Constant1”
0.150
0.100
0.050
Increasing number of clusters decreases
communication/memory bandwidth overheads
0.000
0
5
10
15
20
#cluster
25
30
35
Scaled Speed up Tests
• The full clustering algorithm involves different values of the
number of clusters NC as computation progresses
• The amount of computation per data point is proportional to NC
and so overhead due to memory bandwidth (cache misses)
declines as NC increases
• We did a set of tests on the clustering kernel with fixed NC
• Further we adopted the scaled speed-up approach looking at
the performance as a function of number of parallel threads
with constant number of data points assigned to each thread
– This contrasts with fixed problem size scenario where the number of data
points per thread is inversely proportional to number of threads
• We plot Run time for same workload per thread divided by
number of data points multiplied by number of clusters multiped
by time at smallest data set (10,000 data points per thread)
• Expect this normalized run time to be independent of number of
threads if not for parallel and memory bandwidth overheads
– It will decrease as NC increases as number of computations per points
fetched from memory increases proportional to NC
Intel 8b C with 1 Cluster: Vista Scaled
Run Time for Clustering Kernel
• Note the smallest dataset has highest overheads as we increase the
number of threads
1 Cluster
– Not clear why this is
1.3
Scaled Run Time
1.25
10,000 Datapts
1.2
50,000 Datapts
1.15
500,000 Datapts
1.1
1.05
1
0.95
Number of Threads
0.9
1
2
3
4
5
6
7
8
Intel 8b C with 80 Clusters: Vista
Scaled Run Time for Clustering Kernel
• As we increase number of80clusters,
the
effects
at
Clusters
10,000 data points decrease
0.9
1
2
3
4
10,000 Datapts
50,000 Datapts
500,000 Datapts
Scaled Run Time
0.85
0.8
5
6
Number of Threads
7
8
Intel 8b C# with 1 Cluster: Vista Scaled
Run Time for Clustering Kernel
• C# is similar to C with larger effects
1.6
1.55
Scaled Run Time
1.5
1.45
1.4
10,000Datapts
1.35
1.3
50,000 Datapts
1.25
1.2
500,000 Datapts
1.15
1.1
1.05
1
Number of Threads
0.95
1
2
3
4
5
6
7
8
std / time
Intel 8b C# with 1 Cluster: Vista Run
Time Fluctuations for Clustering Kernel
• This is average of standard deviation of run time of the
1 Cluster(ratio of std to time vs #thread)
8 threads between messaging synchronization points
0.2
Standard Deviation/Run Time
0.1
10,000 Datapts
50,000 Datapts
500,000 Datapts
Number of Threads
0
0
1
2
3
4
5
6
7
8
Intel 8b C# with 80 Clusters: Vista
Scaled Run Time for Clustering Kernel
• C# is similar to C with larger effects
1
Scaled Run Time
0.95
0.9
10,000 Datapts
50,000 Datapts
0.85
500,000 Datapts
Number of Threads
0.8
1
2
3
4
5
6
7
8
AMD4 C with 1 Cluster: XP Scaled Run
Time for Clustering Kernel
Cluster(time
vs #thread)
• This is significantly 1more
stable
than Intel runs and
shows little or no memory bandwidth effect
1.06
Scaled Run Time
1.05
1.04
10,000 Datapts
1.03
50,000 Datapts
500,000 Datapts
1.02
1.01
Number of Threads
1
1
2
3
4
AMD4 C# with 1 Cluster: XP Scaled
Run Time for Clustering Kernel
Cluster than Intel C# 1 Cluster
• This is significantly more1 stable
runs
1.1
Scaled Run Time
10,000 Datapts
50,000 Datapts
500,000 Datapts
1.05
1
Number of Threads
0.95
1
2
3
4
AMD4 C# with 80 Clusters: XP Scaled
Run Time for Clustering Kernel
• This is broadly similar to 8080Cluster
Intel C# runs
Clusters
unlike one cluster case that was very different
0.85
Scaled Run Time
0.8
10,000 Datapts
50,000 Datapts
500,000 Datapts
Number of Threads
0.75
1
2
3
4
AMD4 C# with 1 Cluster: Windows Server
Scaled Run Time for Clustering Kernel
1 Cluster
• This is significantly more stable than Intel C# runs
1.05
Scaled Run Time
10,000 Datapts
50,000 Datapts
1
500,000 Datapts
0.95
Number of Threads
0.9
1
2
3
4
AMD4 C# with 80 Clusters: Windows Server
Scaled Run Time for Clustering Kernel
• Curiously run time decreases a bit as number of
80 Clusters
threads increases in some AMD4 scenarios
0.81
Scaled Run Time
10,000 Datapts
50,000 Datapts
0.8
500,000 Datapts
0.79
0.78
0.77
0.76
Number of Threads
0.75
1
2
3
4
Intel 8c C with 1 Cluster: Red Hat
Scaled Run Time for Clustering Kernel
• Deviations from “perfect” scaled speed-up are much
Cluster
less for Red Hat than for1 Windows
1.15
Scaled Run Time
1.1
10,000 Datapts
50,000 Datapts
500,000 Datapts
1.05
Number of Threads
1
1
2
3
4
5
6
7
8
Intel 8c C with 80 Clusters: Red Hat
Scaled Run Time for Clustering Kernel
• Deviations from “perfect” scaled speed-up are much
80 Clusters
less for Red Hat
1
Scaled Run Time
10,000 Memory
50,000 Memory
500,000 Memory
0.99
Number of Threads
0.98
1
2
3
4
5
6
7
8
Intel 8b C# with 80 Clusters: Vista Run
Time Fluctuations for Clustering Kernel
• This is average of standard deviation of run time of the
80 Cluster(ratio of std to time vs #thread)
8 threads between messaging synchronization points
0.1
Standard Deviation/Run Time
10,000 Datpts
50,000 Datapts
0.05
500,000 Datapts
Number of Threads
0
0
1
2
3
4
5
6
7
8
AMD4 with 1 Cluster: Windows Server Run
Time Fluctuations for Clustering Kernel
• This is average of standard deviation of run time of the 8 threads between
messaging synchronization points
1 Cluster(ratio of std to time vs #thread)
• XP (not shown) is similar
0.2
Standard Deviation/Run Time
10,000 Datapts
50,000 Datapts
500,000 Datapts
0.1
Number of Threads
0
1
2
3
4
Intel 8c with 80 Clusters: Redhat Run
Time Fluctuations for Clustering Kernel
• This is average of standard deviation of run time of the
80 Cluster(ratio of std to time vs #thread)
8 threads between messaging synchronization points
0.006
Standard Deviation/Run Time
0.004
10,000 Datapts
50,000 Datapts
0.002
500,000 Datapts
Number of Threads
0
1
2
3
4
5
6
7
8
DSS Section
• We view system as a collection of services
– in this case
– One to supply data
– One to run parallel clustering
– One to visualize results – in this by spawning
a Google maps browser
– Note we are clustering Indiana census data
• DSS is convenient as built on CCR
Average run time (microseconds)
350
DSS Service Measurements
300
250
200
150
100
50
0
1
10
100
1000
10000
Timing of HP Opteron Multicore as aRound
functiontrips
of number of simultaneous twoway service messages processed (November 2006 DSS Release)

Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better
42
Clustering algorithm annealing by decreasing distance scale and gradually finds more
clusters as resolution improved
Here we see 10 increasing to 30 as algorithm progresses
Download