Parallel Applications and Multicast for Bufferless Interconnect Networks 1

advertisement
Parallel Applications and Multicast
for
Bufferless Interconnect Networks
Alfred Barnat
1
1.1
Benson Tsai
Problem
Javier Novales
application performance can be improved through
the intelligent prioritization of request from critical
threads. In the case of an application with one dispatch thread and many processing threads, for example, overall performance may see an improvement
if packets from the core running the dispatch thread
are made at a higher priority than those originating
at the cores running processing threads.
Definition
Our group has chosen to work on the topic “Bufferless Interconnect Networks – Parallel Applications”.
We would like to expand on the work by Moscibroda and Mutlu on the BLESS protocol for bufferless routing in on-chip networks [4]. We have identified two ways in which we might extend on this
protocol: implement support for broadcast and/or
2 Related Work
multicast traffic or explore ways to prioritize among
traffic generated by multiple threads within one application.
The work in this project will be largely based on
previous work on the BLESS protocol [4]. Since the
publishing of the 2009 paper on the topic, the simula1.2 Motivation
tor used in testing BLESS has been extended to support cache coherency and can now simulate a subset
The main limitation of existing work on BLESS is a
of parallel applications. Specifically the simulator is
lack of thorough testing with parallel multithreaded
able to synchronize multithreaded application traces
applications. Existing performance measurements
on read-write dependencies.
have been limited to running fully predetermined
independent traces on multiple CPU cores. Since Virtual Tree Coherency (VTC) offers one potential
then, Chris Fallen has added a nearly complete cache method to improve cache latency using multicast in
coherency implementation to the simulator used for an on-chip network. Based on the Virtual Circuit
the original testing, so that parallel traces with read- Tree Multicasting (VCTM) protocol, which utilizes
write dependencies can be tested.
dynamically generated routing trees within an onCurrently, the class of cache coherency algorithms chip network to simulated ordered multicast packthat can be implemented on top of the BLESS pro- ets within a subset of nodes [3], VTC offers benetocol is limited by the lack of broadcast and mul- fits over both traditional broadcast-based techniques
ticast support. It is our hope that by adding sup- and directory-based techniques by allowing coherport for some combination of broadcast and mul- ence communication to remain within the working
ticast traffic, we may be able to implement more set of nodes using a given memory location [2].
efficient coherency protocols that improve cache laIn the area of request prioritization, Bhattacharjee
tency or decrease overall memory-related bandwidth
and Martonosi have had success in developing thread
on the network.
criticality predictors in order to detect which proAnother aspect of multithreaded execution not cov- cessing cores are currently running critical threads,
ered in previous BLESS research is whether overall and prioritize requests originating at those cores [1].
1
3
Solution
chip multiprocessor using a bufferless interconnect
network. Second, we want to compare the performance of different cache coherence protocols, espe3.1 Ideas
cially between those that do and do not use multiWe would like to focus our efforts on those use cases cast.
that may benefit from the implementation of an ef- In order to achieve this goal, we will need to follow
ficient multicast and/or broadcast protocol on top roughly the following steps:
of BLESS. As a first step, we could look at imple1. Obtain the existing simulator code and get the
menting VCTM on top of BLESS in order to enable
traces originally used to test BLESS running.
the use of VTC for cache coherency. Further steps
2. Get a working baseline implementation of a
would stem from a more complete investigation of
cache coherence protocol.
existing work in the area of multicast within on-chip
networks, especially in relation to cache coherence,
3. Extend the simulator to support executionas well as our experience as we begin testing techdriven parallel traces.
niques.
4. Implement a multicast-based cache coherence
algorithm on top of BLESS.
3.2
Experimental Methodology
5. Find and test potential improvements to multicast and BLESS.
We will be making use of the existing BLESS simulator, and extending it to test our ideas. Aside
from the protocol-specific additions, we will need to
extend the simulator to support more complex parallel execution patterns. Currently, the simulator
supports synchronization only on dependent memory accesses within static traces. In reality, the execution pattern of a parallel program may vary depending on the way in which the program executes.
That is, the very changes which we want to test could
effect the execution of our test programs.
4.1
Milestone 1
By this point, we should have have step 1 completed,
and ideally have step 2 completed as well. Since
the BLESS simulator does not currently have a fully
working implementation of cache coherency, step 2
could take additional time. We should also have a
good idea of how we would like to go about completing steps 3 and 4, which can be done in parallel.
In order for the simulation timings to be valid, the
execution traces being run on the simulator must be
able to change dynamically. A simple way to achieve
this might be to divide execution traces into separate work blocks and distribute these blocks to the
simulated processor cores as they complete. A more
complex, but more complete solution would be to
integrate a software simulator into the architectural
simulator and base the execution timings within the
software simulator on the architectural simulation.
This would effectively allow us to run real software
directly on the architectural simulation.
4.2
Milestone 2
By this point, we should have steps 2 and 3 completed and be focusing on steps 4 and 5. Work on
step 3 could potentially continue if choose to implement dynamic execution based on the architectural
simulation.
4.3
Alternatives
If experimental data suggests that BLESS may not
benefit from multicast, we can instead focus on improving other aspects of the protocol. For instance,
4 Research Plan
there may be some benefit to adding a small amount
of buffering to BLESS while maintaining the same
Our goal by the end of this project is twofold. First, basic routing mechanism. Alternatively, we could
we want to be able to measure the performance of examine the possibility of critical thread prioritizaparallel applications and benchmarks running on a tion.
2
References
[1] Abhishek Bhattacharjee and Margaret Martonosi. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In ISCA ’09: Proceedings of the 36th
annual international symposium on Computer architecture, pages 290–301, New York, NY, USA, 2009.
ACM. ISBN 978-1-60558-526-0. doi: 10.1145/1555754.1555792.
[2] Natalie D. Enright Jerger, Li-Shiuan Peh, and Mikko H. Lipasti. Virtual tree coherence: Leveraging
regions and in-network multicast trees for scalable cache coherence. In MICRO 41: Proceedings of the
41st annual IEEE/ACM International Symposium on Microarchitecture, pages 35–46, Washington, DC,
USA, 2008. IEEE Computer Society. ISBN 978-1-4244-2836-6. doi: 10.1109/MICRO.2008.4771777.
[3] Natalie Enright Jerger, Li-Shiuan Peh, and Mikko Lipasti. Virtual circuit tree multicasting: A case
for on-chip hardware multicast support. In ISCA ’08: Proceedings of the 35th Annual International
Symposium on Computer Architecture, pages 229–240, Washington, DC, USA, 2008. IEEE Computer
Society. ISBN 978-0-7695-3174-8. doi: 10.1109/ISCA.2008.12.
[4] Thomas Moscibroda and Onur Mutlu. A case for bufferless routing in on-chip networks. In ISCA ’09:
Proceedings of the 36th annual international symposium on Computer architecture, pages 196–207, New
York, NY, USA, 2009. ACM. ISBN 978-1-60558-526-0. doi: 10.1145/1555754.1555781.
3
Download