Message-based MVC and High Performance Multi-core Runtime Xiaohong Qiu xqiu@indiana.edu December 21, 2006 Session Outline My Brief Background Education and Work Experiences Ph.D. Thesis Research Message-based MVC Architecture for Distributed and Desktop Applications Recent Research Project High Performance Multi-core Runtime My Brief Background I 1987 ─ 1991 Computer Science program at Beihang University CS was viewed a promising field to get into at the time Four years of foundation courses, computer hardware & software courses, labs, projects, and internship. Programming languages used include assembly language, Basic, Pascal, Fortran 77, Prolog, Lisp, and C. Programming environment were DOS, Unix, Windows, and Macintosh. 1995 ─ 1998 Computer Science graduate program at Beihang University Graduate Research Assistant at National Lab of Software Development Environment Participated in a team project SNOW (shared memory network of workstations) working on an improved algorithm of parallel IO subsystem based on two-phase method and MPI I/O. 1991 ─ 1998 Faculty at Beihang University Assistant Lecturer & Lecturer, teaching Database and Introduction to Computing courses. My Brief Background II 1998 ─ 2000 M.S., Computer Information Science program at Syracuse University 2000 ─ 2005 Ph.D., Computer Information Science program at Syracuse University The thesis project involved survey, designing, and evaluating a new paradigm for the next generation of rich media software applications that unifies legacy desktop and Internet applications with automatic collaboration and universal access capabilities. Attended conferences for presenting research papers and exhibiting projects Awarded with Syracuse University Fellowship from 1998 to 2001 and Outstanding Graduate Student of College of Electrical Engineering and Computer Science in 2005 May 2005 ─ present Visiting Researcher at Community Grids Lab, Indiana University June ─ November 2006 Software Project Lead at Anabas Inc. Analysis of Concurrency and Coordination Runtime (CCR) and Dynamic Secure Services (DSS) for Parallel and Distributed Computing Message-based MVC (M-MVC) Research Background Architecture of Message-based MVC Collaboration Paradigms SVG Experiments Performance Analysis Summary of Thesis Research Research Background Motivations CPU speed (Moore’s law) and network bandwidth (Gilder’s law) continue to improve bring fundamental changes Internet and Web technologies have evolved into a global information infrastructure for sharing of resources Applications getting increasingly sophisticated Internet collaboration enabling virtual enterprises Large-scale distributed computing Requires new application architecture that is adaptable to fast technology changes with properties such as simplicity, reusability, scalability, reliability, and performance General area is technology support for Synchronous and Asynchronous Resource Sharing e-learning e-science e-business e-entertainment (e.g. video/audio conferencing) (e.g. large-scale distributed computing) (e.g. virtual organizations) (e.g. online game) Research on a generic model of building applications Application domains Distributed (Web) Service Oriented Architecture and Web Services Desktop (Client) Model-View-Controller (MVC) paradigm Internet collaboration Hierarchical Web Service pipeline model Architecture of Message-based MVC A comparison of MVC, Web Service Pipeline, and Message-based MVC Model View Controller Decomposition of SVG Browser Semantic Model Web Service Model Events as messages Message-based MVC Rendering as messages Controller Sematic High Level UI High Level UI View Input port Events as messages Output port Rendering as messages Raw UI Display Display a. MVC Model Messages contain control information Input port Events as messages Output port Rendering as messages Raw UI Display User Interface View Messages contain control information b. Three-stage pipeline Features of Message-based MVC Paradigm M-MVC is a general approach for building applications with a message-based paradigm It emphasizes a universal modularized service model with messaging linkage Converges desktop application, Web application, and Internet collaboration MVC and Web Services are fundamental architectures for desktop and Web applications Web Service pipeline model provides the general collaboration architecture for distributed applications M-MVC is a uniform architecture integrating the above models M-MVC allows automatic collaboration, which simplifies the architecture design Collaboration Paradigms I SMMV vs. MMMV as MVC interactive patterns Model 1 Model 2 Model m-1 Model m View 1 View 2 View n-1 View n Model View 1 View 2 View n-1 View n a) Single Model Multiple View b) Multiple Model Multiple View Flynn’s Taxonomy classifies parallel computing platforms in four types: SISD, MISD, SIMD, and MIMD. SIMD– A single control unit dispatches instructions to each processing unit. MIMD– Each processor is capable of executing a different program independent of the other processors. It enables asynchronous processing. SMMV generalizes the concept of SIMD MMMV generalizes the concept of MIMD In practice, SMMV and MMMV patterns can be applied in both asynchronous and synchronous applications, thus form general collaboration paradigms Collaboration Paradigms II Monolithic collaboration CGL applications of PowerPoint, OpenOffice and data visualization NaradaBrokering SVG browser master client SVG browser master other client SVG browser master other client SVG browser master other client Identical programs receiving identical events Collaboration paradigms deployed with M-MVC model SMMV (e.g. Instructor led learning) MMMV (e.g. Participatory learning) NaradaBrokering Model as Web Service Model as WS Model as WS Broker Broker Model as WS Model as WS NaradaBrokering Broker Broker View View View View View View View View master client other client other client other client master client other client other client other client SMMV MMMV SVG Experiments I Monolithic SVG Experiments Collaborative SVG Browser Collaborative SVG Chess game Players Observers SVG Experiments II Decomposed SVG browser into stages of pipeline Notification service (NaradaBrokering) View (Client) T4 Output (Rendering) GVT tree’ T3 DOM tree’ T2 (mirrored) Event Processor Model (Service) Event Processor T1 DOM tree’ (after mutation) JavaScript Broker Input (UI events) GVT tree DOM tree (mirrored) T0 Event Processor Event Processor T0 Machine A Machine B DOM tree (before mutation) Machine C T0: A given user event such as a mouse click that is sent from View to Model. T1: A given user event such as a mouse click can generate multiple associated DOM change events transmitted from the Model to the View. T1 is the arrival time at the View of the first of these. T2: This is the arrival of the last of these events from the Model and the start of the processing of the set of events in the GVT tree T3: This is the start of the rendering stage T4: This is the end of the rendering stage Performance Analysis I Average Performance of Mouse Events Mousedown events Test Test scenarios Average of all mouse events (mousedown, mousemove, and mouseup) First return – Send time: T1-T0 (milliseconds) First return – Send time: T1-T0 (milliseconds) Last return – Send time: T’1-T0 (milliseconds) End Rendering T4-T0 (microseconds) No distance NB location mean ± error Stdde v mean ± error stddev mean ± error stdde v mean ± error stddev 1 Switch connects Desktop server 33.6 ± 3.0 14.8 37.9 ± 2.1 18.7 48.9± 2.7 23.7 294.0± 20.0 173.0 2 Switch connects High-end Desktop server 18.0 ± 0.57 2.8 18.9 ± 0.89 9.07 31.0 ± 1.7 17.6 123.0 ± 8.9 91.2 3 Office area Linux server 14.9 ± 0.65 2.8 21.0 ± 1.3 10.2 43.9 ± 2.6 20.5 414.0 ± 24.0 185.0 4 Within-City (Campus area) Linux cluster node server 20.0 ± 1.1 4.8 29.7 ± 1.5 13.6 49.5 ± 3.0 26.3 334.0 ± 22.0 194.0 5 Inter-City Solaris server 17.0 ± 0.91 4.3 24.8 ± 1.6 12.8 48.4 ± 3.0 23.3 404.0 ± 20.0 160.0 6 Inter-City Solaris server 20.0 ± 1.3 6.4 29.6 ± 1.7 15.3 50.5 ± 3.4 26.0 337.0 ± 22.0 189.0 Performance Analysis II Immediate bouncing back event Test Test scenarios Bouncing back event Average of all mouse events (mousedown, mousemove, and mouseup) Bounce back – Send time: (milliseconds) First return – Send time: T1-T0 (milliseconds) Last return – Send time: T’1-T0 (milliseconds) End Rendering T4-T0 (milliseconds) No distance NB location mean ± error Stdde v mean ± error stdde v mean ± error stdde v mean ± error stddev 1 Switch connects Desktop server 36.8 ± 2.7 19.0 52.1 ± 2.8 19.4 68.0 ± 3.7 25.9 405.0 ± 23.0 159.0 2 Switch connects High-end Desktop server 20.6 ± 1.3 12.3 29.5 ± 1.5 13.8 49.5 ± 3.1 29.4 158.0 ± 12.0 109.0 3 Office area Linux server 24.3 ± 1.5 11.0 36.3 ± 1.9 14.2 54.2 ± 2.9 21.9 364.0 ± 22.0 166.0 4 Within-City (Campus area) Linux cluster node server 15.4 ± 1.1 7.6 26.9 ± 1.6 11.6 46.7 ± 2.9 20.6 329.0 ± 25.0 179.0 5 Inter-City Solaris server 18.1 ± 1.3 8.8 31.8 ± 2.2 14.5 54.6 ± 4.9 32.8 351.0 ± 27.0 179.0 6 Inter-City Solaris server 21.7 ± 1.4 9.8 37.8 ± 2.7 19.3 55.6 ± 3.4 23.6 364.0 ± 25.0 176.0 Performance Analysis III Basic NB performance in 2 hops and 4 hops Test 2 hops (View – Broker – View) 4 hops (View – Broker – Model – Broker – View) milliseconds milliseconds mean ± error stddev mean ± error stddev 1 7.65 ± 0.61 3.78 13.4 ± 0.98 6.07 2 4.46 ± 0.41 2.53 11.4 ± 0.66 4.09 3 9.16 ± 0.60 3.69 16.9 ± 0.79 4.85 4 7.89 ± 0.61 3.76 14.1 ± 1.1 6.95 5 7.96 ± 0.60 3.68 14.0 ± 0.74 4.54 6 7.96 ± 0.60 3.67 16.8 ± 0.72 4.47 No Comparison of performance results to highlight the importance of the client Message transit time in M-MVC Batik browser all events mousedown event All Events mouseup event mousemove event Mousedown Mouseup Mousemove 14 12 10 8 6 Events per 40 5 ms bin 20 15 Configuration: 2 5 0 10 20 30 40 50 60 minimum T1-T0 in milliseconds 70 80 90 All Events Mousedown Mouseup Mousemove 25 10 0 all events mousedown event mouseup event mousemove event 30 4 0 Message transit time in M-MVC Batik browser 35 number of events in 5 millisecond bins Events per ms bin 165 100 NB on View ; Model and View on tw o desktop PCs; local sw itch netw ork connection; NB version 0.97; TCP blocking protocol; normal thread priority for NB; JMS interface; no echo of messages from Model; 0 10 20 30 40 50 60 minimum T1-T0 in milliseconds 70 80 Time T1-T0 milliseconds NB on Model; Model and View on two desktop 1.5 Ghz PCs; local switch network connection. NB on View; Model and View on two desktop PCs with “high-end” graphics Dell (3 Ghz Pentium) for View; 1.5 Ghz Dell for model; local switch network connection. 90 100 Comparison of performance results with Local and remote NB locations Events per ms bin Events per 15 5 ms bin Message transit time in M-MVC Batik browser 20 5 Message transit time in M-MVC Batik browser all events mousedown event All Events mouseup event mousemove event Mousedown Mouseup Mousemove all events mousedown event mouseup event mousemove event 16 14 number of events in 5 millisecond bins All Events Mousedown Mouseup Mousemove 18 12 10 8 6 10 5 Configuration: NB onsw View Model and View on tw o d local itch;netw ork connection; NB version 0.97; TCP for blocking normal thread priority NB; protocol JMS interface; no echo of messages fr 4 2 0 0 0 10 20 30 40 50 60 minimum T1-T0 in milliseconds 70 80 90 100 0 10 20 30 40 50 60 minimum T1-T0 in milliseconds 70 80 90 Time T1-T0 milliseconds NB on 8-processor Solaris server NB on local 2-processor Linux ripvanwinkle; Model and View on server; Model and View on two 1.5 two 1.5 Ghz desktop PCs; remote Ghz desktop PCs; local switch network connection through routers. network connection. 100 Observations This client to server and back transit time is only 20% of the total processing time in the local examples. The overhead of the Web service decomposition is not directly measured in tests shown these tables The changes in T1-T0 in each row reflect the different network transit times as we move the server from local to organization locations. This overhead of NaradaBrokering itself is 5-15 milliseconds depending on the operating mode of the Broker in simple stand-alone measurements. It consists forming message objects, serialization and network transit time with four hops (client to broker, broker to server, server to broker, broker to client). The contribution of NaradaBrokering to T1-T0 is about 30 milliseconds in preliminary measurements due to the extra thread scheduling inside the operating system and interfacing with complex SVG application. We expect the main impact to be the algorithmic effect of breaking the code into two, the network and broker overhead, thread scheduling from OS We expect our architecture will work dramatically better on multi-core chips Further Java runtime has poor thread performance and can be made much faster Summary of Thesis Research Proposing an “explicit Message-based MVC” paradigm (M-MVC) as the general architecture of Web applications Demonstrating an approach of building “collaboration as a Web service” through monolithic SVG experiments. Bridging the gap between desktop and Web application by leveraging the existing desktop application with a Web service interface through “M-MVC in a publish/subscribe scheme”. As an experiment, we convert a desktop application into a distributed system by modifying the architecture from method-based MVC into message-based MVC. Proposing Multiple Model Multiple View (MMMV) and Single Model Multiple View (SMMV) collaboration as the general architecture of “collaboration as a Web service” model. Identifying some of the key factors that influence the performance of message-based Web applications especially those with rich Web content and high client interactivity and complex rendering issues. High Performance Multi-core Runtime Multi-core Architecture are expected to be the future of “Moore’s Law” with single chip performance coming from parallelism with multiple cores rather than from increased clock speed and sequential architecture improvements This implies parallelism should be used in all applications and not just the familiar scientific and engineering areas The runtime could be message passing for all cases. It is interesting to compare and try to unify runtime for MPI (classic scientific technology), Objects and Services which are all message based We have finished an analysis of Concurrency and Coordination Runtime (CCR) and DSS Service Runtime Research Question: What is “core” multicore runtime and its performance? Many parallel and/or distributed programming models are a supported by a runtime consisting of long-running or dynamic threads exchanging messages Those coming from distributed computing often have overheads of a millisecond or more when ported to multicore (See M-MVC thesis results earlier) Need microsecond level performance on all models – like the best MPI Examination of Microsoft CCR suggests this will be possible Current CCR spawning threads in MPI mode 2-4 microsecond overhead Two-way service style messages around 30 microsecond What are messaging primitives (adding to MPI) and what are their performance Messaging Model Software Typical Applications Streamed Streamed dataflow; SOA CCA, CCR, DSS Apache Synapse, Grid Workflow Dataflow as in AVS, Image Processing; Grids; Web Services Spawned Tree Search CCR Optimization; Computer Chess Queued Discrete Event simulations openRTI, CERTI Ordered Transactions; “war game” style simulations Rendezvous Message Parallelism MPI openMPI MPICH2 PublishSubscribe Enterprise Service Bus NaradaBrokering Mule, JMS Loosely Synchronous applications including engineering & science; rendering Content Delivery; Message Oriented Middleware Overlay Networks Peer-to-Peer Jabber, JXTA, Pastry Skype; Instant Messengers Intel Fall 2005 Multicore Roadmap March 2006 Sun T1000 8 core Server and December 2006 Dell Intel-based 2 Processor, each with 4 Cores Summary of CRR and DSS Project CCR is a message based run time supporting interacting concurrent threads with high efficiency Replaces CLR Thread Pool with Iteration DSS is a Service (not a Web Service) environment designed for Robotics (which has many control and analysis modules implemented as services and linked by workflow) DSS is built on CCR and released by Microsoft We used a 2 processor 2-core AMD Opteron and a 2processor 2-core Intel Xeon and looked at CCR and DSS performance For CCR we chose message patterns similar to those used in MPI For DSS we chose simple one way and two way message exchange between 2 services This is first step in examining possibility of linking science and more general runtime and seeing if we can get very high performance in all cases We see for example about 50 times better performance than Java runtime used in thesis Implementing CCR Performance Measurements CCR is written in C# and we built a suite of test programs in this language Multi-threaded performance analysis tools On the AMD machine, there is the free CodeAnalyst Performance Analyzer It allows one see how work is assigned to threads but it cannot look at microsecond resolution needed for this work Intel thread analyzer (VTune) does not currently support C# or Java Microsoft Visual Studio 2005 Team Suite Performance Analyzer (no support WOW64 or x64 yet) We looked at several thread message exchange patterns similar to basic Exchange and Shift in MPI We took a basic computation whose smallest unit took about 1.4(AMD)1.5(Intel) microseconds We typically ran 107 such units on each core to take 14 or 15 seconds We divided this run from 1 to 107 stages where at end of each stage the threads sent messages (in various patterns) to the next threads that continued computation We measured total execution time as a function of number of stages used with 1 stage having no overheads Typical Thread Analysis Data View One Stage Port 0 Thread0 Message Message Message Message Message Message Message Message Message Message Message Message Port 2 Message Message Port 3 Thread3 Message Message Port 1 Thread2 Port 3 Message Message Thread1 Port 2 Thread3 Message Message Message Port 0 Thread0 Port 1 Thread2 Port 3 Thread3 Message Thread1 Port 2 Thread2 Port 0 Thread0 Port 1 Thread1 Next Stage Message Pipeline which is Simplest loosely synchronous execution in CCR Note CCR supports thread spawning model MPI usually uses fixed threads with message rendezvous Message Port 0 Thread0 Message Thread0 Message Port 1 Thread1 Message Thread1 Message Message EndPort Thread2 Message Message Port 3 Thread3 Message Message Port 2 Thread2 Message Message Thread3 Message Idealized loosely synchronous endpoint (broadcast) in CCR An example of MPI Collective in CCR Write Exchanged Messages Read Messages Port 0 Thread0 Thread1 Port 1 Thread2 Thread3 Thread0 Write Exchanged Messages Port 0 Thread0 Thread1 Port 1 Thread1 Port 2 Thread2 Port 2 Thread2 Port 3 Thread3 Port 3 Thread3 Exchanging Messages with 1D Torus Exchange topology for loosely synchronous execution in CCR (a) Pipeline (b) Shift Thread0 Port 0 Thread0 Port 0 Thread1 Port 1 Thread1 Port 1 Thread2 Port 2 Thread2 Port 2 Thread3 Port 3 Thread3 Port 3 (d) Exchange (c) Two Shifts Thread0 Port 0 Thread0 Port 0 Thread1 Port 1 Thread1 Port 1 Thread2 Port 2 Thread2 Port 2 Port Thread3 Port 3 Thread3 3 Four Communication Patterns used in CCR Tests. (a) and (b) use CCR Receive while (c) and (d) use CCR Multiple Item Receive Millions Average Run Time vs. Maxstage (CCR Test Results) 120 4-way Pipeline Pattern 4 Dispatcher Threads HP Opteron 100 Average Run Time (microsec) Time Seconds 80 Overhead = Computation 60 Y(Ave time microsec) 8.04 microseconds per stage averaged from 1 to 10 million stages 40 20 Computation Component if no Overhead Stages (millions) 0 0 2 4 6 8 10 12 Millions Maxstage 7 units) divided into 4 cores and from 1 to 107 Fixed amount of computation (4.10 stages on HP Opteron Multicore. Each stage separated by reading and writing CCR ports in Pipeline mode Millions 160 4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon 140 120 Time Seconds 100 80 Y(Ave time microsec) Overhead = Computation 60 12.40 microseconds per stage averaged from 1 to 10 million stages 40 Computation Component if no Overhead 20 Stages (millions) 0 0 2 4 6 8 10 12 Millions 7 units) divided into 4 cores and from 1 to 107 Fixed amount of computation (4.10 Maxstage stages on Dell Xeon Multicore. Each stage separated by reading and writing CCR ports in Pipeline mode Summary of Stage Overheads for AMD Machine These are stage switching overheads for a set of runs with different levels of parallelism and different message patterns –each stage takes about 28 microseconds (500,000 stages) Stage Overhead (microseconds) Straight Pipeline 1 2 3 4 8 match 0.77 2.4 3.6 5.0 8.9 default 3.6 4.7 4.4 4.5 8.9 match N/A 3.3 3.4 4.7 11.0 default N/A 5.1 4.2 4.5 8.6 match N/A 4.8 7.7 9.5 26.0 default N/A 8.3 9.0 9.7 24.0 match N/A 11.0 15.8 18.3 Error default N/A 16.8 18.2 18.6 Error Shift Two Shifts Number of Parallel Computations Exchange Summary of Stage Overheads for Intel Machine These are stage switching overheads for a set of runs with different levels of parallelism and different message patterns –each stage takes about 30 microseconds. AMD overheads in parentheses These measurements are equivalent to MPI latencies Stage Overhead (microseconds) Straight Pipeline 1 2 3 4 8 default 1.7 (0.77) 6.9 (3.6) match N/A default N/A match N/A default N/A N/A default N/A 4.0 (3.6) 7.0 (4.4) 5.1 (3.4) 8.9 (4.2) 13.8 (7.7) 24.9 (9.0) 32.7 (15.8) 36.1 (18.2) 9.1 (5.0) 9.1 (4.5) 9.4 (4.7) 9.4 (4.5) 13.4 (9.5) 13.4 (9.7) 41.0 (18.3) 41.0 (18.6) 25.9 (8.9) 16.9 (8.9) 25.0 (11.0) 11.2 (8.6) 52.7 (26.0) 31.5 (24.0) match 3.3 (2.4) 9.5 (4.7) 3.4 (3.3) 9.8 (5.1) 6.8 (4.8) 23.1 (8.3) 28.0 (11.0) 34.6 (16.8) match Shift Two Shifts Number of Parallel Computations Exchange Error Error AMD Bandwidth Measurements • • Previously we measured latency as measurements corresponded to small messages. We did a further set of measurements of bandwidth by exchanging larger messages of different size between threads We used three types of data structures for receiving data – Array in thread equal to message size – Array outside thread equal to message size – Data stored sequentially in a large array (“stepped” array) • For AMD and Intel, total bandwidth 1 to 2 Gigabytes/second Number of stages 250000 Bandwidths in Gigabytes/second summed over 4 cores Array Outside Stepped Array Array Inside Thread Threads Outside Thread Small Large Small Large Small Large 0.90 0.96 1.08 1.09 1.14 1.10 0.89 0.99 1.16 1.11 1.14 1.13 2500 Approx. Compute Time per stage µs 56.0 56.0 7 1.13 up to 10 words 5000 200000 1.19 1.15 1.15 1.13 1.13 up to 107 words 2800 70 Intel Bandwidth Measurements • • For bandwidth, the Intel did better than AMD especially when one exploited cache on chip with small transfers For both AMD and Intel, each stage executed a computational task after copying data arrays of size 105 (labeled small), 106 (labeled large) or 107 double words. The last column is an approximate value in microseconds of the compute time for each stage. Note that copying 100,000 double precision words per core at a gigabyte/second bandwidth takes 3200 µs. The data to be copied (message payload in CCR) is fixed and its creation time is outside timed process Number of stages 250000 Bandwidths in Gigabytes/second summed over 4 cores Array Inside Thread Array Outside Stepped Array Threads Outside Thread Small Large Small Large Small Large 0.84 0.75 1.92 0.90 200000 1.18 0.90 59.5 1.21 0.91 74.4 1.75 1.0 5000 2970 0.83 0.76 1.89 0.89 1.16 0.89 2500 2500 Approx. Compute Time per stage µs 59.5 1.74 0.9 2.0 1.07 1.78 1.06 5950 Millions 200 4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon 180 160 140 Time Seconds 120 100 Time with array copy 80 Slope Change (Cache Effect) 60 Total Bandwidth 1.0 Gigabytes/Sec up to one million double words and 1.75 Gigabytes/Sec up to 100,000 double words 40 20 Array Size: Millions of Double Words 0 0 0.2 0.4 0.6 0.8 1 Millions Typical Bandwidth measurements showing effect of cache with slope change Size run of Copied Point Array size of double array copied in each 5,000 stages with timeFloating plotted against stage from thread to stepped locations in a large array on Dell Xeon Multicore Average run time (microseconds) 350 DSS Service Measurements 300 250 200 150 100 50 0 1 10 100 1000 10000 Timing of HP Opteron Multicore as aRound functiontrips of number of simultaneous twoway service messages processed (November 2006 DSS Release) CGL Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better References Thesis for download http://grids.ucs.indiana.edu/~xqiu/dissertation.html Thesis project http://grids.ucs.indiana.edu/~xqiu/research.html Publications and Presentations http://grids.ucs.indiana.edu/~xqiu/publication.html NaradaBrokering Open Source Messaging System http://www.naradabrokering.org Information about Community Grids Lab project and publications http://grids.ucs.indiana.edu/ptliupages/ Xiaohong Qiu, Geoffrey Fox, Alex Ho, Analysis of Concurrency and Coordination Runtime CCR and DSS for Parallel and Distributed Computing, technical report, November 2006 Shameem Akhter and Jason Robert, Multi-Core Programming, Intel Press, April 2006