18-740 Computer Architecture Fall 2010: Project Proposal DATM: Data-Locality Aware Thread Migration Shun-Ping Chiu, Heng-Tze Cheng, Song Zhao Department of Electrical and Computer Engineering Carnegie Mellon University {shunpinc, hengtze, songzhao}@cmu.edu 1. PROBLEM DEFINITION 2. RELATED WORK The chip-multiprocessor has become the mainstream computing platform. Four-core processors are already available in the consumer market. While the computing power increases with the number of cores, the cache size also needs to increase to match the processor-memory gap. However, as a shared resource, the cache cannot scale easily since increasing the cache size can introduce significant access delay. One solution is to distribute the cache among each processor, and access each cache non-uniformly according the cache-to-thread distance. To improve the performance of static non-uniform cache access (NUCA) [2, 3], dynamic-NUCA is proposed to move frequently used data closer to the cores that need them. While dynamic-NUCA provides significant performance improvement, it often causes unexpected performance penalty in multithreaded application when two threads on two distant cores share the same cache line. This leads to the ping-pong effect, where a cache line continuously jumps between two different cores. The distributed cache and the idea of NUCA have been investigated in [2, 3, 7]. NUCA is designed to serve large integrated cache, which tends to be wire-delay dominated. Instead of having a uniform access to the cache block, NUCA allows different access cycles of cache block according to their distance. This design benefits from the high efficiency of local access. However, the performance without data locality will degrade dramatically due to the longer latency. To improve the data locality, the idea of dynamicNUCA and data migration has been proposed in [3]. By dynamic moving the highly frequent data to the nearby locations, the access latency can be effectively decreased. In this proposal, we propose an alternative approach: Data-locality aware thread migration (DATM). Instead of dynamically moving the cache line from core to core, we plan to perform a cache-aware thread migration to improve the data locality. By migrating the architectural states to the nearby cores, we expect a small migration cost and better flexibility compared to dynamic-NUCA. Furthermore, the ping-pong effect will no longer occur when running multithread applications, because the threads sharing the same cache line now reside on the same core. In addition, the integration of the state-of-art network on chip interconnection will also be considered to make the model more realistic. As for the idea of migrating the core content, recently, Khan et al. [1] proposed the execution migration machine (EM2). Rather than moving data to the core that needs it, the EM2 architecture always moves the execution to the core where the required data resides. However, the evaluation was based on several assumptions that may not apply to many realistic situations. First, it assumes a high-bandwidth interconnect where all messages experience a fixed low latency, which is not practical in the real scenario. Furthermore, the performance was only compared against directory-based cache-coherent architecture but not other state-of-the-art approaches. Chakraborty et al. [8] also propose the idea of splitting the thread code into different fragments. In each processing core, similar fragments belonging to different threads will be grouped and executed. This approach can improve code locality but the issue of balancing workload distribution will rise among processers, considering the actual 18-740 Computer Architecture Fall 2010: Project Proposal finish time of one thread is determined by its slowest fragment. Furthermore, it is necessary to consider the cost of reassembling the thread. Specific data structures are needed to store the mapping between fragment address range and the processing node number, which needs to be updated in run-time. In addition, Hsieh et al. [9] proposed a compile time computation migration mechanism, which migrate part of thread to the cores close to the cache containing the data, and can significantly reduce the overhead of remote memory data access. The drawback of this approach is that the computation migration decision is determined statically during compile time, which means compilation is required each time when executing a new program in the system. 3. PROPOSED METHODS We plan to start with an experiment of a multithread application in a multi-core platform. Each core, which contains a part of the shared cache, connects to each other through a network-onchip architecture. Under these assumptions, we will characterize the cache access behaviors under NUCA and dynamic-NUCA. Based on the application attributes, we will then experiment the thread migration and evaluate the potential cost. Finally, we will design a hybrid algorithm that dynamically selects thread migration or cache line migration as used in dynamic-NUCA. 4. EXPERIMENT METHODOLOGY We plan to use the SimFlex [4] simulator and the DBmbench Database Microbenchmarks. The other multithread benchmarks such as SPALSH2 [5] and PARSEC [6] are also considered as potential options. We plan to conduct experiments on a 36-core system with mesh-NOC interconnect. The performance of our approach will be evaluated against the state-of-art NUCA. 5. RESEARCH PLAN Goal: Achieve data-locality aware thread migration. Develop an algorithm that dynamically selects thread migration or cache line migration. Milestone 1: • Set up simulation tools. • Finalize the architectural assumptions. • Validate the problem of ping-pong effect in cache migration. Milestone 2: • Design the algorithm for thread migration. • Explore design tradeoffs through experiments. 6. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] D. Srinivas, L. Mieszko, K. Omer, “Instruction Level Execution Migration,” MIT Technical Report MIT-CSAIL-TR-2010-019, April, 2010. N. Hardavellas, M. Ferdman, B. Falsafi, A. Ailamaki, “Reactive NUCA: near-optimal block placement and replication in distributed caches,” ACM SIGARCH Computer Architecture News, Vol. 37, Issue 3, Jun. 2009. Jaehyuk Huh, Changkyu Kim, Hazim Shafi, Lixin Zhang, Doug Burger, Stephen W. Keckler, “A NUCA Substrate for Flexible CMP Cache Sharing,” IEEE Transactions on Parallel and Distributed Systems, pp. 1028-1040, Aug. 2007. SimFlex. [http://parsa.epfl.ch/simflex/] S. C. Woo et al., “The SPLASH-2 programs: characterization and methodological considerations,” in Proc. International Symposium on Computer Architecture, 1995. C. Bienia et al., “The PARSEC benchmark suite: characterization and architectural implications,” in Proc. Int’l Conf. Parallel Architectures and Compilation Techniques, 2008. C. Kim et al., “Nonuniform cache architectures for wire-delay dominated on-chip caches”, MICRO 2003 K.Chakraborty et al., “Computation spreading: employing hardware migration to specialize CMP cores on-the-fly,” in Proc. Architectural support for programming languages and operating sytems, 2006. W.C. Hsieh et al., “Computation migration: enhancing locality for distributed-memory parallel sys-tems,” in Proc. ACM Symp. Principles and practice of parallel programming, 1993.