DATM: Data-Locality Aware Thread Migration Shun-Ping Chiu, Heng-Tze Cheng, Song Zhao

advertisement
18-740 Computer Architecture Fall 2010: Project Proposal
DATM: Data-Locality Aware Thread Migration
Shun-Ping Chiu, Heng-Tze Cheng, Song Zhao
Department of Electrical and Computer Engineering
Carnegie Mellon University
{shunpinc, hengtze, songzhao}@cmu.edu
1. PROBLEM DEFINITION
2. RELATED WORK
The chip-multiprocessor has become the mainstream computing platform. Four-core processors
are already available in the consumer market.
While the computing power increases with the
number of cores, the cache size also needs to increase to match the processor-memory gap. However, as a shared resource, the cache cannot scale
easily since increasing the cache size can introduce significant access delay. One solution is to
distribute the cache among each processor, and
access each cache non-uniformly according the
cache-to-thread distance. To improve the performance of static non-uniform cache access
(NUCA) [2, 3], dynamic-NUCA is proposed to
move frequently used data closer to the cores
that need them. While dynamic-NUCA provides
significant performance improvement, it often
causes unexpected performance penalty in multithreaded application when two threads on two
distant cores share the same cache line. This
leads to the ping-pong effect, where a cache line
continuously jumps between two different cores.
The distributed cache and the idea of NUCA
have been investigated in [2, 3, 7]. NUCA is designed to serve large integrated cache, which
tends to be wire-delay dominated. Instead of having a uniform access to the cache block, NUCA
allows different access cycles of cache block according to their distance. This design benefits
from the high efficiency of local access. However,
the performance without data locality will degrade dramatically due to the longer latency. To
improve the data locality, the idea of dynamicNUCA and data migration has been proposed in
[3]. By dynamic moving the highly frequent data
to the nearby locations, the access latency can be
effectively decreased.
In this proposal, we propose an alternative
approach: Data-locality aware thread migration
(DATM). Instead of dynamically moving the
cache line from core to core, we plan to perform
a cache-aware thread migration to improve the
data locality. By migrating the architectural
states to the nearby cores, we expect a small migration cost and better flexibility compared to
dynamic-NUCA. Furthermore, the ping-pong effect will no longer occur when running multithread applications, because the threads sharing
the same cache line now reside on the same core.
In addition, the integration of the state-of-art
network on chip interconnection will also be considered to make the model more realistic.
As for the idea of migrating the core content,
recently, Khan et al. [1] proposed the execution
migration machine (EM2). Rather than moving
data to the core that needs it, the EM2 architecture always moves the execution to the core
where the required data resides. However, the
evaluation was based on several assumptions that
may not apply to many realistic situations. First,
it assumes a high-bandwidth interconnect where
all messages experience a fixed low latency,
which is not practical in the real scenario. Furthermore, the performance was only compared
against directory-based cache-coherent architecture but not other state-of-the-art approaches.
Chakraborty et al. [8] also propose the idea of
splitting the thread code into different fragments.
In each processing core, similar fragments belonging to different threads will be grouped and executed. This approach can improve code locality
but the issue of balancing workload distribution
will rise among processers, considering the actual
18-740 Computer Architecture Fall 2010: Project Proposal
finish time of one thread is determined by its
slowest fragment. Furthermore, it is necessary to
consider the cost of reassembling the thread.
Specific data structures are needed to store the
mapping between fragment address range and
the processing node number, which needs to be
updated in run-time.
In addition, Hsieh et al. [9] proposed a compile time computation migration mechanism,
which migrate part of thread to the cores close to
the cache containing the data, and can significantly reduce the overhead of remote memory
data access. The drawback of this approach is
that the computation migration decision is determined statically during compile time, which
means compilation is required each time when
executing a new program in the system.
3. PROPOSED METHODS
We plan to start with an experiment of a multithread application in a multi-core platform. Each
core, which contains a part of the shared cache,
connects to each other through a network-onchip architecture. Under these assumptions, we
will characterize the cache access behaviors under
NUCA and dynamic-NUCA. Based on the application attributes, we will then experiment the
thread migration and evaluate the potential cost.
Finally, we will design a hybrid algorithm that
dynamically selects thread migration or cache
line migration as used in dynamic-NUCA.
4. EXPERIMENT METHODOLOGY
We plan to use the SimFlex [4] simulator and the
DBmbench Database Microbenchmarks. The
other multithread benchmarks such as SPALSH2 [5] and PARSEC [6] are also considered as potential options. We plan to conduct experiments
on a 36-core system with mesh-NOC interconnect. The performance of our approach will be
evaluated against the state-of-art NUCA.
5. RESEARCH PLAN
Goal: Achieve data-locality aware thread migration. Develop an algorithm that dynamically selects thread migration or cache line migration.
Milestone 1:
• Set up simulation tools.
• Finalize the architectural assumptions.
• Validate the problem of ping-pong effect in
cache migration.
Milestone 2:
• Design the algorithm for thread migration.
• Explore design tradeoffs through experiments.
6. REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
D. Srinivas, L. Mieszko, K. Omer, “Instruction
Level Execution Migration,” MIT Technical Report MIT-CSAIL-TR-2010-019, April, 2010.
N. Hardavellas, M. Ferdman, B. Falsafi, A.
Ailamaki, “Reactive NUCA: near-optimal block
placement and replication in distributed caches,”
ACM SIGARCH Computer Architecture News,
Vol. 37, Issue 3, Jun. 2009.
Jaehyuk Huh, Changkyu Kim, Hazim Shafi,
Lixin Zhang, Doug Burger, Stephen W. Keckler,
“A NUCA Substrate for Flexible CMP Cache
Sharing,” IEEE Transactions on Parallel and
Distributed Systems, pp. 1028-1040, Aug. 2007.
SimFlex. [http://parsa.epfl.ch/simflex/]
S. C. Woo et al., “The SPLASH-2 programs:
characterization and methodological considerations,” in Proc. International Symposium on
Computer Architecture, 1995.
C. Bienia et al., “The PARSEC benchmark suite:
characterization and architectural implications,”
in Proc. Int’l Conf. Parallel Architectures and
Compilation Techniques, 2008.
C. Kim et al., “Nonuniform cache architectures
for wire-delay dominated on-chip caches”, MICRO 2003
K.Chakraborty et al., “Computation spreading:
employing hardware migration to specialize CMP
cores on-the-fly,” in Proc. Architectural support
for programming languages and operating
sytems, 2006.
W.C. Hsieh et al., “Computation migration: enhancing locality for distributed-memory parallel
sys-tems,” in Proc. ACM Symp. Principles and
practice of parallel programming, 1993.
Download