Ali Shafiee Ardestani

advertisement
Ali Shafiee Ardestani
Contact
Information
School of Computing, University of Utah.
50 S. Central Campus Dr. Rm 3190
Salt Lake City, UT 84112
Research
Interests
Computer Architecture: Memory Hierarchies, DRAM organization, Compression-based Main
Memory, Network on-Chip
Education
University of Utah, Salt Lake City, USA
Ph.D. student in Computer Engineering
Advisor: Prof. Rajeev Balasubramonian
(385) 252-7895
shafiee@cs.utah.edu
http://www.cs.utah.edu/~shafiee
January 2012 - Present
Sharif University of Technology, Tehran, Iran
M.S. in Computer Engineering
Advisor: Prof. Hamid Sarbazi-Azad
B.S. in Computer Engineering
Honors and
Awards
August 2007 - July 2010
June 2002 - May 2007
• Rank 7 among nearly 5000 competitors in the nation wide entrance exam for M.S.c program
in Artificial Intelligence (2007).
• Rank 5 among nearly 13000 competitors in the nation wide entrance exam for M.S.c program
in Computer Architecture (2007).
• Rank 140 among nearly 500000 competitors in the nation wide entrance exam for university
in mathematics and physics (2002).
• Bronze medal in final stage of national Olympiad for mathematics (2001).
Publications
submitted to ISCA 2016 ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars
Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan,
Miao Hu, R. Stanley Williams, Vivek Srikumar. (Acceptance Rate 17%)
submitted to ISCA 2016 MemCAD: An Interconnect Exploratory Tool for Innovative Memories
Beyond DDR4
Rajeev Balasubramonian, Andrew B. Kahng, Naveen Muralimanohar, Ali Shafiee, Vaishnav Srinivas. (Acceptance Rate 17%)
Micro 2015 Avoiding Information Leakage in the Memory Controller with Fixed Service Policies
Ali Shafiee, Akhila Gundu, Manjunath Shevgoor, Rajeev Balasubramonian, Mohit Tiwari. (Acceptance Rate 22%)
HPCA 2014 Memzip: Exploiting Unconventional Benefits from Memory Compression,
Ali Shafiee, Meysam Taassori, Rajeev Balasubramonian, Al Davis.(Acceptance Rate 26%)
JLPE 2012 Heterogeneous Interconnect for Low-Power Snoop-Based Chip Multiprocessors,
Narges Shahidi, Ali Shafiee, Amirali Baniasadi.(Impact Factor 0.485)
DATE 2012 AFRA: A low Cost High Performance Reliable Routing for 3D Mesh NoCs,
Ali Shafiee, Sara Akbari, Mahmood Fathy, Reza Berangi. (Acceptance Rate 25%)
ICCAD 2011 Application-Aware Deadlock-Free Oblivious Routing Based on Extended Turn-Model
Ali Shafiee, Mahdy Zolghadr, Mohammad Arjomand, Hamid Sarbazi-Azad. (Acceptance Rate
27%)
ICCD 2011 A Morphable Phase Change Memory Architecture Considering Frequent Zero Values,
Mohammad Arjomand, Amin Jadidi, Ali Shafiee, Hamid Sarbazi-Azad.(Acceptance Rate 25%)
ICCD 2010 Helia: Heterogeneous Interconnect for Low Resolution Cache Access in Snoop-Based
Chip Multiprocessors,
Ali Shafiee, Narges Shahidi, Amirali Baniasadi.(Acceptance Rate 25%)
WEED 2010 Using Partial Tag Comparison in Low-Power Snoop-Based Chip Multiprocessors,
Ali Shafiee, Narges Shahidi, Amirali Baniasadi.
Patents
Six patents under review (with HP enterprise).
Professional
Experience
Caspian Co. , Tehran, Iran Co-op Intern
Embedded programming for banking system machine.
HP enterprise internship
Selected
Research
Projects
2011-2012
Summer 2015
ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic
in Crossbars
A number of recent efforts have attempted to design accelerators for popular machine learning
algorithms, such as those involving convolutional and deep neural networks (CNNs and DNNs).
These algorithms typically involve a large number of multiply-accumulate (dot-product) operations.
A recent project, DaDianNao, adopts a near data processing approach, where a specialized neural
functional unit performs all the digital arithmetic operations and receives input weights from adjacent
eDRAM banks.
This work explores an in-situ processing approach, where memristor crossbar arrays not only store
input weights, but are also used to perform dot-product operations in an analog manner. While the
use of crossbar memory as an analog dot-product engine is well known, no prior work has designed
or characterized a full-fledged accelerator based on crossbars. In particular, our work makes the
following contributions: (i) We design a pipelined architecture, with some crossbars dedicated for
each neural network layer, and eDRAM buffers that aggregate data between pipeline stages. (ii)
We define new data encoding techniques that are amenable to analog computations and that can
reduce the high overheads of analog-to-digital conversion (ADC). (iii) We define the many supporting
digital components required in an analog CNN accelerator and carry out a design space exploration
to identify the best balance of memristor storage/compute, ADCs, and eDRAM storage on a chip.
On a suite of CNN and DNN workloads, the proposed ISAAC architecture yields improvements of
14.8X, 5.5X, and 7.5X in throughput, energy, and computational density (respectively), relative to
the state-of-the-art DaDianNao architecture.
MemCAD: An Interconnect Exploratory Tool for Innovative Memories Beyond DDR4
Historically, server designers have opted for simple memory systems by picking one of a few commoditized DDR memory products. We are already witnessing a major upheaval in the off-chip memory
hierarchy, with the introduction of many new memory products – buffer-on-board, LRDIMM, HMC,
and NVMs to name a few. Given the plethora of choices, it is expected that different vendors will
adopt different strategies for their high-capacity memory systems. These strategies will likely differ
in their choice of interconnect and topology, with a significant fraction of memory energy being
dissipated in I/O and data movement. To make the case for memory interconnect specialization,
this paper makes three contributions.
First, we design a tool, MemCAD, that carefully models I/O power in the memory system, explores the design space, and gives the user the ability to define new types of memory interconnects/topologies. The tool is validated against SPICE models. Our analysis with the tool shows
that several design parameters have a significant impact on I/O power.
Second, we introduce the cascaded channel and narrow channel architectures to serve as case studies for the MemCAD tool, and show the potential for benefit from re-organizing basic memory
interconnects.
Avoiding Information Leakage in the Memory Controller with Fixed Service Policies.
Trusted applications frequently execute in tandem with untrusted applications on personal devices
and in cloud environments. Since these co-scheduled applications share hardware resources, the
latencies encountered by the untrusted application betray information about whether the trusted
applications are accessing shared resources or not. This work develops a comprehensive approach
to eliminate timing channels in the memory controller that has two key elements: (i) We shape
the memory access behavior of each thread so that it has an unchanging memory access pattern.
(ii) We show how efficient memory access pipelines can be constructed to process the resulting
memory accesses without introducing any resource conflicts. We mathematically show that the
proposed system yields zero information leakage. We then show that various page mapping policies
can impact the throughput of our secure memory system. We also introduce techniques to re-order
requests from different threads to boost performance without leaking information.
Memzip: Exploiting Unconventional Benefits from Memory Compression (HPCA 2014)
Memory compression has been proposed and deployed in the past to grow the capacity of a memory
system and reduce page fault rates. Compression also has secondary benefits: it can reduce energy
and bandwidth demands. However, prior mechanisms have been designed to focus on the capacity
metric and they have not been optimized to reduce energy or bandwidth. Further, mechanisms that
focus on the capacity metric also require complex logic to locate the requested data in memory.
In this paper, we design a highly simple compressed memory architecture that does not target
the capacity metric. Instead, it focuses on complexity, energy, and bandwidth. It relies on rank
subsetting and a careful placement of compressed data and metadata to achieve these benefits.
Further, the space made available via compression is used to boost other metrics – the space can be
used to implement stronger error correction codes or energy-efficient data encodings
USIMM: the Utah Simulated Memory Module (TR-UUCS-12-002)
USIMM is a cycle-accurate DRAM simulator, developed at the Utah Arch lab for the 3rd JILP
Workshop on Computer Architecture Competitions (Memory Scheduling Championship), held with
ISCA 2012. Besides being used for the competition by groups from different universities and from
the DRAM industry, the simulator has now been used by more than a handful of publications from
various groups. In this project, I have added support for a number of features: rank-subsetting, 3D
memory+logic devices, and Buffer-on-Board designs.
Application-aware deadlock-free oblivious routing based on extended turn-model (IC-
CAD 2011)
Programmable hardware is gaining popularity as it can keep pace with growing performance demand
in tight power budget, design and test cost, and serious reliability concerns of future multiprocessor
embedded systems. Compatible with this trend, Network-on-Chip, as a potential bottleneck of
future multi-cores, should also support programmability. Here, we address this issue in design and
implementation of routing algorithm for two-dimensional mesh. To this end, we allocate paths
based on input traffic pattern and in parallel with customizing routing restriction for deadlock
freedom. To achieve this, we propose extended turn model (ETM), a novel parametric deadlock-free
routing for 2D meshes that generalize prior turn-based routing methods (e.g., odd-even) with great
degree of freedoms. This model facilitates design of Mixed-Integer Linear Programming (MILP)
approach, which considers channel dependency turns as independent variables and decides for both
path allocation and routing restriction. We solve this problem by genetic algorithm and evaluate it
using simulation experiments
Heterogeneous Interconnect for Low-Power Snoop-Based Chip Multiprocessors (WEED
2010, ICCD 2010, JLPE 2012)
In this work we introduce Heterogeneous Interconnect for Low Resolution Cache Access (Helia).
Helia improves energy efficiency in snoop-based chip multiprocessors as it eliminates unnecessary
activities in both interconnect and cache. This is achieved by using innovative snoop filtering
mechanisms coupled with wire management techniques. Our optimizations rely on the observation
that a high percentage of cache mismatches could be detected by utilizing a small subset but highly
informative portion of the tag bits. Helia relies on the snoop controller to detect possible remote
tag mismatches prior to tag array lookup. Power is reduced as a) our wire management techniques
permit slow transmission of a subset of tag bits while tag mismatches are being detected and b) we
avoid cache access for mismatches detected at the snoop controller.
Skills
Languages : C, C++, Shell Scripting.
System simulators : Simics, Booksim, SESC, Marss x86, SimpleScalar.
CAD Tools : Synopsys Design Compiler.
Professional
Service
Reviewer: IEEE Transactions on Computers 2012.
Reviewer: IEEE Computer Architecture Letter 2015.
Reviewer: The journal of Supercomputing 2015.
Coursework
Graduate : Computer Architecture, Advance Computer Architecture, Digital VLSI Design, Interconnection Network, Fault Tolerant System, Many-Core Parallel Programming, Algorithm Design.
References
Dr. Rajeev Balasubramonian
School of Computing
University of Utah
rajeev@cs.utah.edu
Dr. Erik Brunvad
School of Computing
University of Utah
elb@cs.utah.edu
Dr. Vivek Srikumar
School of Computing
University of Utah
svivek@cs.utah.edu
Download