apandya, cgopal, mdharkar “SYNERGY” – A SHARED RESOURCE AWARE PRIORITIZATION ALGORITHM FOR CHIP MULTIPROCESSORS Problem Statement: Recent Multicore processors employ a different request prioritization policy at each shared resource individually. They do not take into account the inter-dependence between shared resources in order to process a single request. We believe that developing a holistic policy that incorporates the knowledge acquired from each resource (core, cache, interconnect, memory controller) could help us improve the system throughput and ensure fairness in resource allocation. Related Work: All the proposed algorithms in the recent research focus on defining a prioritization policy based on a single shared resource. Most the recent scheduling algorithms like ATLAS[2], PAR-BS[3] are implemented on the memory controller and employ an out-of-order scheduling for maximizing row access locality and bank-level parallelism. The prioritizing policy proposed by Das et all [4] are application aware, distinguishing applications based on the stall time criticality of their packets. As proposed by Yuan et al [1], the memory request stream is reordered at the interconnect in turn justifying the use of a simple, in-order DRAM scheduler. They consider row access locality for prioritizing the memory access requests at the interconnect, without considering the interconnect latencies. Kim et all [2] aimed to make the prioritization decision based on the thread ranking given by the least attained-service to maximize system throughput and a long time quantum to provide scalability. Also Mutlu et all [3] propose to exploit the bank level parallelism of the memory accesses while providing fairness to individual threads by reducing the memory related stall time experienced by each thread. While these policies have their own benefits, we believe that coordinating the decisions made at each shared resource would help in enhancing the overall system performance. Resource Prioritization parameter Router 1)Amount of Deflection in the network space 2) Turnaround time of the packets (using information received from core) Memory Controller Row Buffer Locality (using information received from router) apandya, cgopal, mdharkar How to solve the problem: We would consider a mesh network as our base topology to incorporate the shared resource awareness. While deciding priorities at the NoC and at the memory controller, we iteratively converge to a single decision for prioritizing each memory request. The metrics we would take into account at the interconnect (NoC) are the deflection count of the packets and the turnaround time of the previous packet at that particular core. Sine a mesh network has certain congestion points, considering the deflection that each packet has suffered in a network would give us a reasonable prioritization metric. In case of a conflict with another request, the router would look at the turn around time passed on by the requesting core. This turnaround time would be calculated considering a window of multiple instructions. When each NoC sends a packet intended for a particular destination it uses a single virtual channel thus reserving specific channels for each memory controller. This static virtual channel allocation (SVCA) helps in reordering the packets based on the priorities calculated. On reaching the memory controller, the controller would consider the priorities that have been sent by the router, and would incorporate row buffer locality to further improve the memory access. To optimize the performance of the prioritization policy , we propose to look at the prefetching in NoCs and data migration in the shared cache. Description of experimental setup:We would use a mesh topology for the simulations. We intend to use the BLESS simulator for implementing the above mentioned prioritization algorithm. The benchmark suite we intend to use is SPEC CPU 2000. Brief research plan:The goal is to come up with a holistic scheduling algorithm that exploits the knowledge of each resource. 1) Milestone 1:We plan to set up the simulation environment required for our experiments and also implement the baseline on this environment. 2) Milestone 2: We would be implementing the router prioritization policy considering deflection and turnaround time and incorporate the row buffer locality as a priority metric at the memory controller scheduler. 3)Milestone 3: Evaluate the performance of our algorithm possibly defining new metrics for the priority decisions. If needed , we would look at the prefetching in the NoCs and data migration to optimize the effectiveness of our algorithm. apandya, cgopal, mdharkar References:[1]Complexity Effective Memory Access Scheduling for Many-core Accelerator Architectures. Yuan, Bakhoda, Aamodt[MICRO2009] [2] ATLAS: A scalable and high performance scheduling algorithm for multiple memory controllers. Kim, Han, Mutlu, Harchol-Balter [MICRO2009] [3] PAR-BS: Parallelism Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. Mutlu, Moscibroda[ISCA2008] [4] Application aware prioritization mechanisms for On-Chip Networks. By Das, Mutlu, Moscibroda and C.Das [MICRO2009]