Ali Shafiee Ardestani Contact Information School of Computing, University of Utah. 50 S. Central Campus Dr. Rm 3190 Salt Lake City, UT 84112 Research Interests Computer Architecture: Memory Hierarchies, DRAM organization, Compression-based Main Memory, Network on-Chip Education University of Utah, Salt Lake City, USA Ph.D. student in Computer Engineering Advisor: Prof. Rajeev Balasubramonian (385) 252-7895 shafiee@cs.utah.edu http://www.cs.utah.edu/~shafiee January 2012 - Present Sharif University of Technology, Tehran, Iran M.S. in Computer Engineering Advisor: Prof. Hamid Sarbazi-Azad B.S. in Computer Engineering Honors and Awards August 2007 - July 2010 June 2002 - May 2007 • Rank 7 among nearly 5000 competitors in the nation wide entrance exam for M.S.c program in Artificial Intelligence (2007). • Rank 5 among nearly 13000 competitors in the nation wide entrance exam for M.S.c program in Computer Architecture (2007). • Rank 140 among nearly 500000 competitors in the nation wide entrance exam for university in mathematics and physics (2002). • Bronze medal in final stage of national Olympiad for mathematics (2001). Publications submitted to ISCA 2016 ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, Vivek Srikumar. (Acceptance Rate 17%) submitted to ISCA 2016 MemCAD: An Interconnect Exploratory Tool for Innovative Memories Beyond DDR4 Rajeev Balasubramonian, Andrew B. Kahng, Naveen Muralimanohar, Ali Shafiee, Vaishnav Srinivas. (Acceptance Rate 17%) Micro 2015 Avoiding Information Leakage in the Memory Controller with Fixed Service Policies Ali Shafiee, Akhila Gundu, Manjunath Shevgoor, Rajeev Balasubramonian, Mohit Tiwari. (Acceptance Rate 22%) HPCA 2014 Memzip: Exploiting Unconventional Benefits from Memory Compression, Ali Shafiee, Meysam Taassori, Rajeev Balasubramonian, Al Davis.(Acceptance Rate 26%) JLPE 2012 Heterogeneous Interconnect for Low-Power Snoop-Based Chip Multiprocessors, Narges Shahidi, Ali Shafiee, Amirali Baniasadi.(Impact Factor 0.485) DATE 2012 AFRA: A low Cost High Performance Reliable Routing for 3D Mesh NoCs, Ali Shafiee, Sara Akbari, Mahmood Fathy, Reza Berangi. (Acceptance Rate 25%) ICCAD 2011 Application-Aware Deadlock-Free Oblivious Routing Based on Extended Turn-Model Ali Shafiee, Mahdy Zolghadr, Mohammad Arjomand, Hamid Sarbazi-Azad. (Acceptance Rate 27%) ICCD 2011 A Morphable Phase Change Memory Architecture Considering Frequent Zero Values, Mohammad Arjomand, Amin Jadidi, Ali Shafiee, Hamid Sarbazi-Azad.(Acceptance Rate 25%) ICCD 2010 Helia: Heterogeneous Interconnect for Low Resolution Cache Access in Snoop-Based Chip Multiprocessors, Ali Shafiee, Narges Shahidi, Amirali Baniasadi.(Acceptance Rate 25%) WEED 2010 Using Partial Tag Comparison in Low-Power Snoop-Based Chip Multiprocessors, Ali Shafiee, Narges Shahidi, Amirali Baniasadi. Patents Six patents under review (with HP enterprise). Professional Experience Caspian Co. , Tehran, Iran Co-op Intern Embedded programming for banking system machine. HP enterprise internship Selected Research Projects 2011-2012 Summer 2015 ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars A number of recent efforts have attempted to design accelerators for popular machine learning algorithms, such as those involving convolutional and deep neural networks (CNNs and DNNs). These algorithms typically involve a large number of multiply-accumulate (dot-product) operations. A recent project, DaDianNao, adopts a near data processing approach, where a specialized neural functional unit performs all the digital arithmetic operations and receives input weights from adjacent eDRAM banks. This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner. While the use of crossbar memory as an analog dot-product engine is well known, no prior work has designed or characterized a full-fledged accelerator based on crossbars. In particular, our work makes the following contributions: (i) We design a pipelined architecture, with some crossbars dedicated for each neural network layer, and eDRAM buffers that aggregate data between pipeline stages. (ii) We define new data encoding techniques that are amenable to analog computations and that can reduce the high overheads of analog-to-digital conversion (ADC). (iii) We define the many supporting digital components required in an analog CNN accelerator and carry out a design space exploration to identify the best balance of memristor storage/compute, ADCs, and eDRAM storage on a chip. On a suite of CNN and DNN workloads, the proposed ISAAC architecture yields improvements of 14.8X, 5.5X, and 7.5X in throughput, energy, and computational density (respectively), relative to the state-of-the-art DaDianNao architecture. MemCAD: An Interconnect Exploratory Tool for Innovative Memories Beyond DDR4 Historically, server designers have opted for simple memory systems by picking one of a few commoditized DDR memory products. We are already witnessing a major upheaval in the off-chip memory hierarchy, with the introduction of many new memory products – buffer-on-board, LRDIMM, HMC, and NVMs to name a few. Given the plethora of choices, it is expected that different vendors will adopt different strategies for their high-capacity memory systems. These strategies will likely differ in their choice of interconnect and topology, with a significant fraction of memory energy being dissipated in I/O and data movement. To make the case for memory interconnect specialization, this paper makes three contributions. First, we design a tool, MemCAD, that carefully models I/O power in the memory system, explores the design space, and gives the user the ability to define new types of memory interconnects/topologies. The tool is validated against SPICE models. Our analysis with the tool shows that several design parameters have a significant impact on I/O power. Second, we introduce the cascaded channel and narrow channel architectures to serve as case studies for the MemCAD tool, and show the potential for benefit from re-organizing basic memory interconnects. Avoiding Information Leakage in the Memory Controller with Fixed Service Policies. Trusted applications frequently execute in tandem with untrusted applications on personal devices and in cloud environments. Since these co-scheduled applications share hardware resources, the latencies encountered by the untrusted application betray information about whether the trusted applications are accessing shared resources or not. This work develops a comprehensive approach to eliminate timing channels in the memory controller that has two key elements: (i) We shape the memory access behavior of each thread so that it has an unchanging memory access pattern. (ii) We show how efficient memory access pipelines can be constructed to process the resulting memory accesses without introducing any resource conflicts. We mathematically show that the proposed system yields zero information leakage. We then show that various page mapping policies can impact the throughput of our secure memory system. We also introduce techniques to re-order requests from different threads to boost performance without leaking information. Memzip: Exploiting Unconventional Benefits from Memory Compression (HPCA 2014) Memory compression has been proposed and deployed in the past to grow the capacity of a memory system and reduce page fault rates. Compression also has secondary benefits: it can reduce energy and bandwidth demands. However, prior mechanisms have been designed to focus on the capacity metric and they have not been optimized to reduce energy or bandwidth. Further, mechanisms that focus on the capacity metric also require complex logic to locate the requested data in memory. In this paper, we design a highly simple compressed memory architecture that does not target the capacity metric. Instead, it focuses on complexity, energy, and bandwidth. It relies on rank subsetting and a careful placement of compressed data and metadata to achieve these benefits. Further, the space made available via compression is used to boost other metrics – the space can be used to implement stronger error correction codes or energy-efficient data encodings USIMM: the Utah Simulated Memory Module (TR-UUCS-12-002) USIMM is a cycle-accurate DRAM simulator, developed at the Utah Arch lab for the 3rd JILP Workshop on Computer Architecture Competitions (Memory Scheduling Championship), held with ISCA 2012. Besides being used for the competition by groups from different universities and from the DRAM industry, the simulator has now been used by more than a handful of publications from various groups. In this project, I have added support for a number of features: rank-subsetting, 3D memory+logic devices, and Buffer-on-Board designs. Application-aware deadlock-free oblivious routing based on extended turn-model (IC- CAD 2011) Programmable hardware is gaining popularity as it can keep pace with growing performance demand in tight power budget, design and test cost, and serious reliability concerns of future multiprocessor embedded systems. Compatible with this trend, Network-on-Chip, as a potential bottleneck of future multi-cores, should also support programmability. Here, we address this issue in design and implementation of routing algorithm for two-dimensional mesh. To this end, we allocate paths based on input traffic pattern and in parallel with customizing routing restriction for deadlock freedom. To achieve this, we propose extended turn model (ETM), a novel parametric deadlock-free routing for 2D meshes that generalize prior turn-based routing methods (e.g., odd-even) with great degree of freedoms. This model facilitates design of Mixed-Integer Linear Programming (MILP) approach, which considers channel dependency turns as independent variables and decides for both path allocation and routing restriction. We solve this problem by genetic algorithm and evaluate it using simulation experiments Heterogeneous Interconnect for Low-Power Snoop-Based Chip Multiprocessors (WEED 2010, ICCD 2010, JLPE 2012) In this work we introduce Heterogeneous Interconnect for Low Resolution Cache Access (Helia). Helia improves energy efficiency in snoop-based chip multiprocessors as it eliminates unnecessary activities in both interconnect and cache. This is achieved by using innovative snoop filtering mechanisms coupled with wire management techniques. Our optimizations rely on the observation that a high percentage of cache mismatches could be detected by utilizing a small subset but highly informative portion of the tag bits. Helia relies on the snoop controller to detect possible remote tag mismatches prior to tag array lookup. Power is reduced as a) our wire management techniques permit slow transmission of a subset of tag bits while tag mismatches are being detected and b) we avoid cache access for mismatches detected at the snoop controller. Skills Languages : C, C++, Shell Scripting. System simulators : Simics, Booksim, SESC, Marss x86, SimpleScalar. CAD Tools : Synopsys Design Compiler. Professional Service Reviewer: IEEE Transactions on Computers 2012. Reviewer: IEEE Computer Architecture Letter 2015. Reviewer: The journal of Supercomputing 2015. Coursework Graduate : Computer Architecture, Advance Computer Architecture, Digital VLSI Design, Interconnection Network, Fault Tolerant System, Many-Core Parallel Programming, Algorithm Design. References Dr. Rajeev Balasubramonian School of Computing University of Utah rajeev@cs.utah.edu Dr. Erik Brunvad School of Computing University of Utah elb@cs.utah.edu Dr. Vivek Srikumar School of Computing University of Utah svivek@cs.utah.edu