15-740 Project Proposal: A Micro-Architecture Level Study of FAWN Architecture Bin Fan (binfan@cs.cmu.edu), Lin Xiao (lxiao@cs.cmu.edu), Prashant Kashinkunti(pkashink@andrew.cmu.edu) September 27, 2010 1 Project Description FAWN (Fast Array of Wimppy Nodes) [6, 1] is a scalable and energy-efficient cluster architecture for dataintensive computing. A FAWN cluster consists of a large number of ”wimpy” nodes with energy-efficient processors and small amount of flash memory to serve workloads such as key-value lookups or running MapReduce jobs [3]. Current research of FAWN has been done mainly on optimizations at the system level. Therefore in this project we plan to study FAWN at the micro-architecture level. The goal is to provide better understanding of designing FAWN-like architectures: which mechanisms are more important to deliver performance-power efficiency and what are the potential tradeoffs. Specifically we are targeting at the workload of Key-Value lookups using FAWN architecture. This type of workload is traditionally I/O intensive and bounded by the storage (flash in FAWN) I/O speed. However techniques such as Haffman encoded prefix tree and cuckoo hash are used on the query path in latest FAWN implementation1 , which require more CPU performance. It is possible to observe difference in performance and power consumption with optimized micro-architecture. 2 Related Work The Gordon project [5] described a flash-based system architecture for massively parallel, data-centric computing. It analyzed the design space of Gordon system and the trade-offs mainly by varying CPU, replacing disk with flash memory. There is not much about the architecture level optimization. Similar in all papers about FAWN [6, 1], they focus more on how to take current hardware and system architectures to reduce the consumption of power. In contrast to Gordon and FAWN, there is work done to study a specific workload in microarchitectures’ view. Performance and power-efficiency of web searching workload is studied with both server-class (Xeon) and mobile-class (Atom) microarchitectures[4]. They conclude small core design is 5x efficient in terms of power measurement. 3 Design In this project, we aim at following specific problems: • Memory is very important to ensure high query throughput, but also has high power consumption. We plan to study the performance and power consumption with different configuration space of memory to find a point balancing the tradeoff well. 1 The reason is to reduce memory consumption of indexing structure 1 • Cache also has great impact on throughput. We plan to study the cache hit ratio and different type of L2 cache access to see how well cache is utilized. Particularly for key-value workload, we might be able to optimize the cache replacement policy in this case. • Microarchitectural events as Key-Value lookups execute, such as the fraction of different type of instructions and their performance, the percentage of execution time spent stalled for different reasons. The purpose is to study if the data path is fully utilized and find the bottleneck. We also try to improve the microarchitecture or redesign the system to benefit more from the current architecture. 4 Experimental Methodology In order to study the peroformance and power efficiency with different microarchitecture configurations, we plan to run simulations on Wattch[2]. Wattch is a simulator for analyzing and optimizing power dissipation at the architecture level. Wattch integrates power models for these common structures along with performance simulator in order to get the power estimates along with performance estimates. 5 Research Plan Goal • 100% goal is to test different microarchitectures on the Simulator with our targeted workloads, understand the design tradeoffs and look for a good balance for both throughput and energy saving. • 75% goal is to test current microarchitectures of FAWN on the Simulator with our targeted workloads, find some bottlenecks in current FAWN and give suggestions. • 125% goal is to test different microarchitectures on the Simulator and real deployment, with multiple typical workloads of data intensive computing, understand the implications of the workloads to the design. Milestones • Milestone 1: have the simulator run; generate different microarchitectures; • Milestone 2: test targeted workload with one or more microarchitectures; finish preliminary measurement References [1] David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. FAWN: A fast array of wimpy nodes. In SOSP 2009. [2] David Brooks, Vivek Tiwari, and Margaret Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. In International Symposium on Computer Architecture, 2000. [3] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI 2004, 2004. [4] Vijay Janapa Reddi, Benjamin Lee, Trishul Chilimbi, and Kushagra Vaid. Web search using small cores: Quantifying the price of efficiency. Technical report, 2009. [5] Adrian M. Cauleld Laura M. Grupp Steven Swanson. Gordon: Using flash memory to build fast, powerefcient clusters for data-intensive applications. In ASPLOS 2009, 2009. [6] Vijay Vasudevan, David Andersen, Michael Kaminsky, Lawrence Tan, Jason Franklin, and Iulian Moraru. Energy-efficient cluster computing with fawn: Workloads and implication. In e-Energy 2010, 2010. 2