Message-based MVC and High Performance Multi-core Runtime Xiaohong Qiu

advertisement
Message-based MVC
and
High Performance Multi-core Runtime
Xiaohong Qiu
xqiu@indiana.edu
December 21, 2006
Session Outline
My Brief Background
Education and Work Experiences
Ph.D. Thesis Research
Message-based MVC Architecture for
Distributed and Desktop Applications
Recent Research Project
High Performance Multi-core Runtime
My Brief Background I
1987 ─ 1991 Computer Science program at Beihang
University
CS was viewed a promising field to get into at the time
Four years of foundation courses, computer hardware & software
courses, labs, projects, and internship. Programming languages
used include assembly language, Basic, Pascal, Fortran 77,
Prolog, Lisp, and C. Programming environment were DOS, Unix,
Windows, and Macintosh.
1995 ─ 1998 Computer Science graduate program at
Beihang University
Graduate Research Assistant at National Lab of Software
Development Environment
Participated in a team project SNOW (shared memory network of
workstations) working on an improved algorithm of parallel IO
subsystem based on two-phase method and MPI I/O.
1991 ─ 1998 Faculty at Beihang University
Assistant Lecturer & Lecturer, teaching Database and Introduction
to Computing courses.
My Brief Background II
1998 ─ 2000 M.S., Computer Information Science program at
Syracuse University
2000 ─ 2005 Ph.D., Computer Information Science program at
Syracuse University
The thesis project involved survey, designing, and evaluating a new
paradigm for the next generation of rich media software applications that
unifies legacy desktop and Internet applications with automatic
collaboration and universal access capabilities. Attended conferences for
presenting research papers and exhibiting projects
Awarded with Syracuse University Fellowship from 1998 to 2001 and
Outstanding Graduate Student of College of Electrical Engineering
and Computer Science in 2005
May 2005 ─ present Visiting Researcher at Community Grids Lab,
Indiana University
June ─ November 2006 Software Project Lead at Anabas Inc.
Analysis of Concurrency and Coordination Runtime (CCR) and Dynamic
Secure Services (DSS) for Parallel and Distributed Computing
Message-based MVC (M-MVC)
Research Background
Architecture of Message-based MVC
Collaboration Paradigms
SVG Experiments
Performance Analysis
Summary of Thesis Research
Research Background
Motivations
CPU speed (Moore’s law) and network bandwidth (Gilder’s law) continue to
improve bring fundamental changes
Internet and Web technologies have evolved into a global information infrastructure
for sharing of resources
Applications getting increasingly sophisticated
Internet collaboration enabling virtual enterprises
Large-scale distributed computing
Requires new application architecture that is adaptable to fast technology changes
with properties such as simplicity, reusability, scalability, reliability, and
performance
General area is technology support for Synchronous and
Asynchronous Resource Sharing
e-learning
e-science
e-business
e-entertainment
(e.g. video/audio conferencing)
(e.g. large-scale distributed computing)
(e.g. virtual organizations)
(e.g. online game)
Research on a generic model of building applications
Application domains
Distributed (Web)
Service Oriented Architecture and Web Services
Desktop (Client)
Model-View-Controller (MVC) paradigm
Internet collaboration
Hierarchical Web Service pipeline model
Architecture of Message-based MVC
A comparison of MVC, Web Service Pipeline, and Message-based MVC
Model View Controller
Decomposition of SVG Browser
Semantic
Model
Web Service
Model
Events as
messages
Message-based MVC
Rendering as
messages
Controller
Sematic
High Level UI
High Level UI
View
Input port
Events as
messages
Output port
Rendering as
messages
Raw UI
Display
Display
a. MVC Model
Messages contain control information
Input port
Events as
messages
Output port
Rendering as
messages
Raw UI
Display
User Interface
View
Messages contain control information
b. Three-stage pipeline
Features of Message-based MVC Paradigm
M-MVC is a general approach for building applications with a message-based
paradigm
It emphasizes a universal modularized service model with messaging linkage
Converges desktop application, Web application, and Internet collaboration
MVC and Web Services are fundamental architectures for desktop and Web
applications
Web Service pipeline model provides the general collaboration architecture for
distributed applications
M-MVC is a uniform architecture integrating the above models
M-MVC allows automatic collaboration, which simplifies the architecture design
Collaboration Paradigms I
SMMV vs. MMMV as MVC interactive patterns
Model 1
Model 2
Model m-1
Model m
View 1
View 2
View n-1
View n
Model
View 1
View 2
View n-1
View n
a) Single Model Multiple View
b) Multiple Model Multiple View
Flynn’s Taxonomy classifies parallel computing platforms in four types:
SISD, MISD, SIMD, and MIMD.
SIMD– A single control unit dispatches instructions to each processing unit.
MIMD– Each processor is capable of executing a different program
independent of the other processors. It enables asynchronous processing.
SMMV generalizes the concept of SIMD
MMMV generalizes the concept of MIMD
In practice, SMMV and MMMV patterns can be applied in both asynchronous
and synchronous applications, thus form general collaboration paradigms
Collaboration Paradigms II
Monolithic collaboration
CGL applications of PowerPoint, OpenOffice and data visualization
NaradaBrokering
SVG
browser
master
client
SVG
browser
master
other
client
SVG
browser
master
other
client
SVG
browser
master
other
client
Identical programs receiving identical events
Collaboration paradigms deployed with M-MVC model
SMMV (e.g. Instructor led learning)
MMMV (e.g. Participatory learning)
NaradaBrokering
Model
as Web Service
Model
as WS
Model
as WS
Broker
Broker
Model
as WS
Model
as WS
NaradaBrokering
Broker
Broker
View
View
View
View
View
View
View
View
master
client
other
client
other
client
other
client
master
client
other
client
other
client
other
client
SMMV
MMMV
SVG Experiments I
Monolithic SVG Experiments
Collaborative SVG Browser
Collaborative SVG Chess game
Players
Observers
SVG Experiments II
Decomposed SVG browser into stages of pipeline
Notification service
(NaradaBrokering)
View (Client)
T4
Output
(Rendering)
GVT tree’
T3 DOM tree’ T2
(mirrored)
Event
Processor
Model (Service)
Event
Processor
T1
DOM tree’
(after mutation)
JavaScript
Broker
Input
(UI events)
GVT tree
DOM tree
(mirrored)
T0
Event
Processor
Event
Processor
T0
Machine A
Machine B
DOM tree
(before mutation)
Machine C
T0: A given user event such as a mouse click that is sent from View to
Model.
T1: A given user event such as a mouse click can generate multiple
associated DOM change events transmitted from the Model to the
View. T1 is the arrival time at the View of the first of these.
T2: This is the arrival of the last of these events from the Model and
the start of the processing of the set of events in the GVT tree
T3: This is the start of the rendering stage
T4: This is the end of the rendering stage
Performance Analysis I
Average Performance of Mouse Events
Mousedown events
Test
Test scenarios
Average of all mouse events (mousedown, mousemove, and mouseup)
First return – Send
time: T1-T0
(milliseconds)
First return – Send
time: T1-T0
(milliseconds)
Last return – Send
time: T’1-T0
(milliseconds)
End Rendering
T4-T0 (microseconds)
No
distance
NB
location
mean ±
error
Stdde
v
mean ±
error
stddev
mean ±
error
stdde
v
mean ±
error
stddev
1
Switch
connects
Desktop
server
33.6 ±
3.0
14.8
37.9 ± 2.1
18.7
48.9± 2.7
23.7
294.0±
20.0
173.0
2
Switch
connects
High-end
Desktop
server
18.0 ±
0.57
2.8
18.9 ±
0.89
9.07
31.0 ± 1.7
17.6
123.0 ±
8.9
91.2
3
Office area
Linux
server
14.9 ±
0.65
2.8
21.0 ± 1.3
10.2
43.9 ±
2.6
20.5
414.0 ±
24.0
185.0
4
Within-City
(Campus
area)
Linux
cluster
node
server
20.0 ±
1.1
4.8
29.7 ± 1.5
13.6
49.5 ±
3.0
26.3
334.0 ±
22.0
194.0
5
Inter-City
Solaris
server
17.0 ±
0.91
4.3
24.8 ± 1.6
12.8
48.4 ± 3.0
23.3
404.0 ±
20.0
160.0
6
Inter-City
Solaris
server
20.0 ±
1.3
6.4
29.6 ± 1.7
15.3
50.5 ± 3.4
26.0
337.0 ±
22.0
189.0
Performance Analysis II
Immediate bouncing back event
Test
Test scenarios
Bouncing back
event
Average of all mouse events (mousedown, mousemove, and
mouseup)
Bounce back –
Send time:
(milliseconds)
First return –
Send time: T1-T0
(milliseconds)
Last return –
Send time: T’1-T0
(milliseconds)
End Rendering
T4-T0 (milliseconds)
No
distance
NB location
mean ±
error
Stdde
v
mean ±
error
stdde
v
mean ±
error
stdde
v
mean ±
error
stddev
1
Switch
connects
Desktop
server
36.8 ±
2.7
19.0
52.1 ±
2.8
19.4
68.0 ±
3.7
25.9
405.0 ±
23.0
159.0
2
Switch
connects
High-end
Desktop
server
20.6 ±
1.3
12.3
29.5 ±
1.5
13.8
49.5 ±
3.1
29.4
158.0 ±
12.0
109.0
3
Office area
Linux server
24.3 ±
1.5
11.0
36.3 ±
1.9
14.2
54.2 ±
2.9
21.9
364.0 ±
22.0
166.0
4
Within-City
(Campus
area)
Linux cluster
node server
15.4 ±
1.1
7.6
26.9 ±
1.6
11.6
46.7 ±
2.9
20.6
329.0 ±
25.0
179.0
5
Inter-City
Solaris
server
18.1 ±
1.3
8.8
31.8 ±
2.2
14.5
54.6 ±
4.9
32.8
351.0 ±
27.0
179.0
6
Inter-City
Solaris
server
21.7 ±
1.4
9.8
37.8 ±
2.7
19.3
55.6 ±
3.4
23.6
364.0 ±
25.0
176.0
Performance Analysis III
Basic NB performance in 2 hops and 4 hops
Test
2 hops
(View – Broker – View)
4 hops
(View – Broker – Model – Broker – View)
milliseconds
milliseconds
mean ± error
stddev
mean ± error
stddev
1
7.65 ± 0.61
3.78
13.4 ± 0.98
6.07
2
4.46 ± 0.41
2.53
11.4 ± 0.66
4.09
3
9.16 ± 0.60
3.69
16.9 ± 0.79
4.85
4
7.89 ± 0.61
3.76
14.1 ± 1.1
6.95
5
7.96 ± 0.60
3.68
14.0 ± 0.74
4.54
6
7.96 ± 0.60
3.67
16.8 ± 0.72
4.47
No
Comparison of performance results to
highlight the importance of the client
Message transit time in M-MVC Batik browser
all events
mousedown event
All Events
mouseup event
mousemove event
Mousedown
Mouseup
Mousemove
14
12
10
8
6
Events per
40
5 ms bin
20
15
Configuration:
2
5
0
10
20
30
40
50
60
minimum T1-T0 in milliseconds
70
80
90
All Events
Mousedown
Mouseup
Mousemove
25
10
0
all events
mousedown event
mouseup event
mousemove event
30
4
0
Message transit time in M-MVC Batik browser
35
number of events in 5 millisecond bins
Events per
ms bin
165
100
NB on View ; Model and View on tw o desktop PCs;
local sw itch netw ork connection;
NB version 0.97; TCP blocking protocol;
normal thread priority for NB;
JMS interface; no echo of messages from Model;
0
10
20
30
40
50
60
minimum T1-T0 in milliseconds
70
80
Time T1-T0 milliseconds
NB on Model; Model and View on
two desktop 1.5 Ghz PCs; local
switch network connection.
NB on View; Model and View on two
desktop PCs with “high-end” graphics
Dell (3 Ghz Pentium) for View; 1.5 Ghz
Dell for model; local switch network
connection.
90
100
Comparison of performance results with
Local and remote NB locations
Events per
ms bin
Events per
15
5 ms bin
Message transit time in M-MVC Batik browser
20
5
Message transit time in M-MVC Batik browser
all events
mousedown event
All Events
mouseup event
mousemove event
Mousedown
Mouseup
Mousemove
all events
mousedown event
mouseup event
mousemove event
16
14
number of events in 5 millisecond bins
All Events
Mousedown
Mouseup
Mousemove
18
12
10
8
6
10
5
Configuration:
NB onsw
View
Model
and
View on tw o d
local
itch;netw
ork
connection;
NB
version
0.97;
TCP for
blocking
normal
thread
priority
NB; protocol
JMS interface; no echo of messages fr
4
2
0
0
0
10
20
30
40
50
60
minimum T1-T0 in milliseconds
70
80
90
100
0
10
20
30
40
50
60
minimum T1-T0 in milliseconds
70
80
90
Time T1-T0 milliseconds
NB on 8-processor Solaris server
NB on local 2-processor Linux
ripvanwinkle; Model and View on
server; Model and View on two 1.5
two 1.5 Ghz desktop PCs; remote
Ghz desktop PCs; local switch
network connection through routers.
network connection.
100
Observations
This client to server and back transit time is only 20% of the total
processing time in the local examples.
The overhead of the Web service decomposition is not directly measured
in tests shown these tables
The changes in T1-T0 in each row reflect the different network transit
times as we move the server from local to organization locations.
This overhead of NaradaBrokering itself is 5-15 milliseconds depending
on the operating mode of the Broker in simple stand-alone
measurements. It consists forming message objects, serialization and
network transit time with four hops (client to broker, broker to server,
server to broker, broker to client).
The contribution of NaradaBrokering to T1-T0 is about 30 milliseconds in
preliminary measurements due to the extra thread scheduling inside the
operating system and interfacing with complex SVG application.
We expect the main impact to be the algorithmic effect of breaking the
code into two, the network and broker overhead, thread scheduling from
OS
We expect our architecture will work dramatically better on multi-core
chips
Further Java runtime has poor thread performance and can be made
much faster
Summary of Thesis Research
Proposing an “explicit Message-based MVC” paradigm (M-MVC) as
the general architecture of Web applications
Demonstrating an approach of building “collaboration as a Web
service” through monolithic SVG experiments.
Bridging the gap between desktop and Web application by leveraging
the existing desktop application with a Web service interface through
“M-MVC in a publish/subscribe scheme”.
As an experiment, we convert a desktop application into a distributed
system by modifying the architecture from method-based MVC into
message-based MVC.
Proposing Multiple Model Multiple View (MMMV) and Single Model
Multiple View (SMMV) collaboration as the general architecture of
“collaboration as a Web service” model.
Identifying some of the key factors that influence the performance of
message-based Web applications especially those with rich Web
content and high client interactivity and complex rendering issues.
High Performance Multi-core Runtime
Multi-core Architecture are expected to be the
future of “Moore’s Law” with single chip
performance coming from parallelism with
multiple cores rather than from increased clock
speed and sequential architecture improvements
This implies parallelism should be used in all
applications and not just the familiar scientific and
engineering areas
The runtime could be message passing for all
cases. It is interesting to compare and try to unify
runtime for MPI (classic scientific technology),
Objects and Services which are all message
based
We have finished an analysis of Concurrency and
Coordination Runtime (CCR) and DSS Service
Runtime
Research Question: What is “core” multicore runtime and its performance?
Many parallel and/or distributed programming models are a supported by a runtime
consisting of long-running or dynamic threads exchanging messages
Those coming from distributed computing often have overheads of a millisecond or
more when ported to multicore (See M-MVC thesis results earlier)
Need microsecond level performance on all models – like the best MPI
Examination of Microsoft CCR suggests this will be possible
Current CCR spawning threads in MPI mode 2-4 microsecond overhead
Two-way service style messages around 30 microsecond
What are messaging primitives (adding to MPI) and what are their performance
Messaging Model
Software
Typical Applications
Streamed
Streamed
dataflow; SOA
CCA, CCR, DSS
Apache Synapse, Grid
Workflow
Dataflow as in AVS, Image Processing;
Grids; Web Services
Spawned
Tree Search
CCR
Optimization; Computer Chess
Queued
Discrete Event
simulations
openRTI, CERTI
Ordered Transactions;
“war game” style simulations
Rendezvous
Message
Parallelism MPI
openMPI
MPICH2
PublishSubscribe
Enterprise
Service Bus
NaradaBrokering
Mule, JMS
Loosely Synchronous applications
including engineering & science;
rendering
Content Delivery;
Message Oriented Middleware
Overlay
Networks
Peer-to-Peer
Jabber, JXTA, Pastry
Skype; Instant Messengers
Intel Fall 2005 Multicore Roadmap
March 2006 Sun T1000 8 core Server and December
2006 Dell Intel-based 2 Processor, each with 4 Cores
Summary of CRR and DSS Project
CCR is a message based run time supporting interacting
concurrent threads with high efficiency
Replaces CLR Thread Pool with Iteration
DSS is a Service (not a Web Service) environment designed
for Robotics (which has many control and analysis modules
implemented as services and linked by workflow)
DSS is built on CCR and released by Microsoft
We used a 2 processor 2-core AMD Opteron and a 2processor 2-core Intel Xeon and looked at CCR and DSS
performance
For CCR we chose message patterns similar to those used in MPI
For DSS we chose simple one way and two way message exchange
between 2 services
This is first step in examining possibility of linking science
and more general runtime and seeing if we can get very high
performance in all cases
We see for example about 50 times better performance than
Java runtime used in thesis
Implementing CCR Performance Measurements
CCR is written in C# and we built a suite of test programs in this
language
Multi-threaded performance analysis tools
On the AMD machine, there is the free CodeAnalyst Performance Analyzer
It allows one see how work is assigned to threads but it cannot look at microsecond
resolution needed for this work
Intel thread analyzer (VTune) does not currently support C# or Java
Microsoft Visual Studio 2005 Team Suite Performance Analyzer (no support
WOW64 or x64 yet)
We looked at several thread message exchange patterns similar to basic
Exchange and Shift in MPI
We took a basic computation whose smallest unit took about 1.4(AMD)1.5(Intel) microseconds
We typically ran 107 such units on each core to take 14 or 15 seconds
We divided this run from 1 to 107 stages where at end of each stage the
threads sent messages (in various patterns) to the next threads that
continued computation
We measured total execution time as a function of number of stages
used with 1 stage having no overheads
Typical Thread Analysis Data View
One Stage
Port
0
Thread0
Message
Message
Message
Message
Message
Message
Message
Message
Message
Message
Message
Message
Port
2
Message
Message
Port
3
Thread3
Message
Message
Port
1
Thread2
Port
3
Message
Message
Thread1
Port
2
Thread3
Message
Message
Message
Port
0
Thread0
Port
1
Thread2
Port
3
Thread3
Message
Thread1
Port
2
Thread2
Port
0
Thread0
Port
1
Thread1
Next Stage
Message
Pipeline which is Simplest loosely synchronous execution in CCR
Note CCR supports thread spawning model
MPI usually uses fixed threads with message rendezvous
Message
Port
0
Thread0
Message
Thread0
Message
Port
1
Thread1
Message
Thread1
Message
Message
EndPort
Thread2
Message
Message
Port
3
Thread3
Message
Message
Port
2
Thread2
Message
Message
Thread3
Message
Idealized loosely synchronous endpoint (broadcast) in CCR
An example of MPI Collective in CCR
Write
Exchanged
Messages
Read
Messages
Port
0
Thread0
Thread1
Port
1
Thread2
Thread3
Thread0
Write
Exchanged
Messages
Port
0
Thread0
Thread1
Port
1
Thread1
Port
2
Thread2
Port
2
Thread2
Port
3
Thread3
Port
3
Thread3
Exchanging Messages with 1D Torus Exchange
topology for loosely synchronous execution in CCR
(a) Pipeline
(b) Shift
Thread0
Port
0
Thread0
Port
0
Thread1
Port
1
Thread1
Port
1
Thread2
Port
2
Thread2
Port
2
Thread3
Port
3
Thread3
Port
3
(d) Exchange
(c) Two Shifts
Thread0
Port
0
Thread0
Port
0
Thread1
Port
1
Thread1
Port
1
Thread2
Port
2
Thread2
Port
2
Port
Thread3
Port
3
Thread3
3
Four Communication Patterns used in CCR Tests. (a) and (b) use CCR Receive
while (c) and (d) use CCR Multiple Item Receive
Millions
Average Run Time vs. Maxstage (CCR Test Results)
120
4-way Pipeline Pattern
4 Dispatcher Threads
HP Opteron
100
Average Run Time (microsec)
Time Seconds
80
Overhead =
Computation
60
Y(Ave time microsec)
8.04 microseconds per stage
averaged from 1 to 10 million
stages
40
20
Computation Component if no Overhead
Stages (millions)
0
0
2
4
6
8
10
12
Millions
Maxstage
7 units) divided into 4 cores and from 1 to 107
Fixed amount of computation (4.10
stages on HP Opteron Multicore. Each stage separated by reading and writing CCR
ports in Pipeline mode
Millions
160
4-way Pipeline Pattern
4 Dispatcher Threads
Dell Xeon
140
120
Time Seconds
100
80
Y(Ave time microsec)
Overhead =
Computation
60
12.40 microseconds per stage
averaged from 1 to 10 million
stages
40
Computation Component if no Overhead
20
Stages (millions)
0
0
2
4
6
8
10
12
Millions
7 units) divided into 4 cores and from 1 to 107
Fixed amount of computation (4.10
Maxstage
stages on Dell Xeon Multicore. Each stage separated by reading and writing CCR
ports in Pipeline mode
Summary of Stage Overheads for AMD Machine
These are stage switching overheads for a set of
runs with different levels of parallelism and
different message patterns –each stage takes
about 28 microseconds (500,000 stages)
Stage Overhead
(microseconds)
Straight
Pipeline
1
2
3
4
8
match
0.77
2.4
3.6
5.0
8.9
default
3.6
4.7
4.4
4.5
8.9
match
N/A
3.3
3.4
4.7
11.0
default
N/A
5.1
4.2
4.5
8.6
match
N/A
4.8
7.7
9.5
26.0
default
N/A
8.3
9.0
9.7
24.0
match
N/A
11.0
15.8
18.3
Error
default
N/A
16.8
18.2
18.6
Error
Shift
Two
Shifts
Number of Parallel Computations
Exchange
Summary of Stage Overheads for Intel Machine
These are stage switching overheads for a set of runs with
different levels of parallelism and different message patterns
–each stage takes about 30 microseconds. AMD overheads
in parentheses
These measurements are equivalent to MPI latencies
Stage Overhead
(microseconds)
Straight
Pipeline
1
2
3
4
8
default
1.7
(0.77)
6.9
(3.6)
match
N/A
default
N/A
match
N/A
default
N/A
N/A
default
N/A
4.0
(3.6)
7.0
(4.4)
5.1
(3.4)
8.9
(4.2)
13.8
(7.7)
24.9
(9.0)
32.7
(15.8)
36.1
(18.2)
9.1
(5.0)
9.1
(4.5)
9.4
(4.7)
9.4
(4.5)
13.4
(9.5)
13.4
(9.7)
41.0
(18.3)
41.0
(18.6)
25.9
(8.9)
16.9
(8.9)
25.0
(11.0)
11.2
(8.6)
52.7
(26.0)
31.5
(24.0)
match
3.3
(2.4)
9.5
(4.7)
3.4
(3.3)
9.8
(5.1)
6.8
(4.8)
23.1
(8.3)
28.0
(11.0)
34.6
(16.8)
match
Shift
Two
Shifts
Number of Parallel Computations
Exchange
Error
Error
AMD Bandwidth Measurements
•
•
Previously we measured latency as measurements corresponded to small
messages. We did a further set of measurements of bandwidth by
exchanging larger messages of different size between threads
We used three types of data structures for receiving data
– Array in thread equal to message size
– Array outside thread equal to message size
– Data stored sequentially in a large array (“stepped” array)
•
For AMD and Intel, total bandwidth 1 to 2 Gigabytes/second
Number of
stages
250000
Bandwidths in Gigabytes/second summed over 4 cores
Array Outside
Stepped Array
Array Inside Thread
Threads
Outside Thread
Small
Large
Small
Large
Small
Large
0.90
0.96
1.08
1.09
1.14
1.10
0.89
0.99
1.16
1.11
1.14
1.13
2500
Approx.
Compute
Time per
stage µs
56.0
56.0
7
1.13 up to 10 words
5000
200000
1.19
1.15
1.15
1.13
1.13 up to 107 words
2800
70
Intel Bandwidth Measurements
•
•
For bandwidth, the Intel did better than AMD especially when one
exploited cache on chip with small transfers
For both AMD and Intel, each stage executed a computational task
after copying data arrays of size 105 (labeled small), 106 (labeled
large) or 107 double words. The last column is an approximate value
in microseconds of the compute time for each stage. Note that
copying 100,000 double precision words per core at a
gigabyte/second bandwidth takes 3200 µs. The data to be copied
(message payload in CCR) is fixed and its creation time is outside
timed process
Number of
stages
250000
Bandwidths in Gigabytes/second summed over 4 cores
Array Inside Thread
Array Outside
Stepped Array
Threads
Outside Thread
Small
Large
Small
Large
Small
Large
0.84
0.75
1.92
0.90
200000
1.18
0.90
59.5
1.21
0.91
74.4
1.75
1.0
5000
2970
0.83
0.76
1.89
0.89
1.16
0.89
2500
2500
Approx.
Compute
Time per
stage µs
59.5
1.74
0.9
2.0
1.07
1.78
1.06
5950
Millions
200
4-way Pipeline Pattern
4 Dispatcher Threads
Dell Xeon
180
160
140
Time Seconds
120
100
Time with array copy
80
Slope Change
(Cache Effect)
60
Total Bandwidth 1.0 Gigabytes/Sec
up to one million double words and
1.75 Gigabytes/Sec up to
100,000 double words
40
20
Array Size: Millions of Double Words
0
0
0.2
0.4
0.6
0.8
1
Millions
Typical Bandwidth measurements showing effect of cache with slope change
Size run
of Copied
Point
Array size of double array copied in each
5,000 stages with
timeFloating
plotted
against
stage from thread to stepped locations in a large array on Dell Xeon Multicore
Average run time (microseconds)
350
DSS Service Measurements
300
250
200
150
100
50
0
1
10
100
1000
10000
Timing of HP Opteron Multicore as aRound
functiontrips
of number of simultaneous twoway service messages processed (November 2006 DSS Release)

CGL Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better
References
Thesis for download
http://grids.ucs.indiana.edu/~xqiu/dissertation.html
Thesis project
http://grids.ucs.indiana.edu/~xqiu/research.html
Publications and Presentations
http://grids.ucs.indiana.edu/~xqiu/publication.html
NaradaBrokering Open Source Messaging System
http://www.naradabrokering.org
Information about Community Grids Lab project and publications
http://grids.ucs.indiana.edu/ptliupages/
Xiaohong Qiu, Geoffrey Fox, Alex Ho, Analysis of Concurrency and
Coordination Runtime CCR and DSS for Parallel and Distributed Computing,
technical report, November 2006
Shameem Akhter and Jason Robert, Multi-Core Programming, Intel Press,
April 2006
Download