Parallel Applications and Multicast for Bufferless Interconnect Networks Alfred Barnat 1 1.1 Benson Tsai Problem Javier Novales application performance can be improved through the intelligent prioritization of request from critical threads. In the case of an application with one dispatch thread and many processing threads, for example, overall performance may see an improvement if packets from the core running the dispatch thread are made at a higher priority than those originating at the cores running processing threads. Definition Our group has chosen to work on the topic “Bufferless Interconnect Networks – Parallel Applications”. We would like to expand on the work by Moscibroda and Mutlu on the BLESS protocol for bufferless routing in on-chip networks [4]. We have identified two ways in which we might extend on this protocol: implement support for broadcast and/or 2 Related Work multicast traffic or explore ways to prioritize among traffic generated by multiple threads within one application. The work in this project will be largely based on previous work on the BLESS protocol [4]. Since the publishing of the 2009 paper on the topic, the simula1.2 Motivation tor used in testing BLESS has been extended to support cache coherency and can now simulate a subset The main limitation of existing work on BLESS is a of parallel applications. Specifically the simulator is lack of thorough testing with parallel multithreaded able to synchronize multithreaded application traces applications. Existing performance measurements on read-write dependencies. have been limited to running fully predetermined independent traces on multiple CPU cores. Since Virtual Tree Coherency (VTC) offers one potential then, Chris Fallen has added a nearly complete cache method to improve cache latency using multicast in coherency implementation to the simulator used for an on-chip network. Based on the Virtual Circuit the original testing, so that parallel traces with read- Tree Multicasting (VCTM) protocol, which utilizes write dependencies can be tested. dynamically generated routing trees within an onCurrently, the class of cache coherency algorithms chip network to simulated ordered multicast packthat can be implemented on top of the BLESS pro- ets within a subset of nodes [3], VTC offers benetocol is limited by the lack of broadcast and mul- fits over both traditional broadcast-based techniques ticast support. It is our hope that by adding sup- and directory-based techniques by allowing coherport for some combination of broadcast and mul- ence communication to remain within the working ticast traffic, we may be able to implement more set of nodes using a given memory location [2]. efficient coherency protocols that improve cache laIn the area of request prioritization, Bhattacharjee tency or decrease overall memory-related bandwidth and Martonosi have had success in developing thread on the network. criticality predictors in order to detect which proAnother aspect of multithreaded execution not cov- cessing cores are currently running critical threads, ered in previous BLESS research is whether overall and prioritize requests originating at those cores [1]. 1 3 Solution chip multiprocessor using a bufferless interconnect network. Second, we want to compare the performance of different cache coherence protocols, espe3.1 Ideas cially between those that do and do not use multiWe would like to focus our efforts on those use cases cast. that may benefit from the implementation of an ef- In order to achieve this goal, we will need to follow ficient multicast and/or broadcast protocol on top roughly the following steps: of BLESS. As a first step, we could look at imple1. Obtain the existing simulator code and get the menting VCTM on top of BLESS in order to enable traces originally used to test BLESS running. the use of VTC for cache coherency. Further steps 2. Get a working baseline implementation of a would stem from a more complete investigation of cache coherence protocol. existing work in the area of multicast within on-chip networks, especially in relation to cache coherence, 3. Extend the simulator to support executionas well as our experience as we begin testing techdriven parallel traces. niques. 4. Implement a multicast-based cache coherence algorithm on top of BLESS. 3.2 Experimental Methodology 5. Find and test potential improvements to multicast and BLESS. We will be making use of the existing BLESS simulator, and extending it to test our ideas. Aside from the protocol-specific additions, we will need to extend the simulator to support more complex parallel execution patterns. Currently, the simulator supports synchronization only on dependent memory accesses within static traces. In reality, the execution pattern of a parallel program may vary depending on the way in which the program executes. That is, the very changes which we want to test could effect the execution of our test programs. 4.1 Milestone 1 By this point, we should have have step 1 completed, and ideally have step 2 completed as well. Since the BLESS simulator does not currently have a fully working implementation of cache coherency, step 2 could take additional time. We should also have a good idea of how we would like to go about completing steps 3 and 4, which can be done in parallel. In order for the simulation timings to be valid, the execution traces being run on the simulator must be able to change dynamically. A simple way to achieve this might be to divide execution traces into separate work blocks and distribute these blocks to the simulated processor cores as they complete. A more complex, but more complete solution would be to integrate a software simulator into the architectural simulator and base the execution timings within the software simulator on the architectural simulation. This would effectively allow us to run real software directly on the architectural simulation. 4.2 Milestone 2 By this point, we should have steps 2 and 3 completed and be focusing on steps 4 and 5. Work on step 3 could potentially continue if choose to implement dynamic execution based on the architectural simulation. 4.3 Alternatives If experimental data suggests that BLESS may not benefit from multicast, we can instead focus on improving other aspects of the protocol. For instance, 4 Research Plan there may be some benefit to adding a small amount of buffering to BLESS while maintaining the same Our goal by the end of this project is twofold. First, basic routing mechanism. Alternatively, we could we want to be able to measure the performance of examine the possibility of critical thread prioritizaparallel applications and benchmarks running on a tion. 2 References [1] Abhishek Bhattacharjee and Margaret Martonosi. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In ISCA ’09: Proceedings of the 36th annual international symposium on Computer architecture, pages 290–301, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-526-0. doi: 10.1145/1555754.1555792. [2] Natalie D. Enright Jerger, Li-Shiuan Peh, and Mikko H. Lipasti. Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. In MICRO 41: Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture, pages 35–46, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-1-4244-2836-6. doi: 10.1109/MICRO.2008.4771777. [3] Natalie Enright Jerger, Li-Shiuan Peh, and Mikko Lipasti. Virtual circuit tree multicasting: A case for on-chip hardware multicast support. In ISCA ’08: Proceedings of the 35th Annual International Symposium on Computer Architecture, pages 229–240, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-0-7695-3174-8. doi: 10.1109/ISCA.2008.12. [4] Thomas Moscibroda and Onur Mutlu. A case for bufferless routing in on-chip networks. In ISCA ’09: Proceedings of the 36th annual international symposium on Computer architecture, pages 196–207, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-526-0. doi: 10.1145/1555754.1555781. 3