ICCAD 2014 Workshop: Heterogeneous Computing Platforms (HCP) Thursday, November 6, 2014, Hilton San Jose, CA Technical Program Opening Remarks 8:15 – 8:30 Organizers: Gi-Joon Nam (IBM) and Mustafa Ozdal (Intel) Session 1: Emerging Heterogeneous Architectures 8:30 – 10:30 Chair: Mustafa Ozdal, Intel "ASPIRE: Specializing the Software and Hardware Stacks” Krste Asanovic, University of California, Berkeley Speaker Bio “Acceleration-Rich Architectures – from Single-chip to Datacenters” Jason Cong, University of California Los Angeles Speaker Bio “Architecting and Exploiting Asymmetry in Multi-Core Architectures” Onur Mutlu, Carnegie Mellon University Speaker Bio Coffee Break 10:30 – 11:00 Session 2: High-Level Synthesis and Applications on Heterogeneous Platforms 11:00 – 12:00 Chair: Gi-Joon Nam, IBM Research “FCUDA: the CUDA to FPGA Compiler” Deming Chen, University of Illinois at Urbana-Champaign Speaker Bio “Towards Efficient Next-Generation Genome Sequencing” Mishali Naik and Ganapati Srinivasa, Intel Speaker Bios Lunch 12:00 – 13:00 Session 3: Latest Heterogeneous Platforms and Programming Models 13:00 – 15:00 Chair: Danny Bathen, Intel “Bringing Shared-Memory Reconfigurable Logic to the Datacenter and the Cloud” Peter Hofstee, IBM Austin Research Lab Abstract and Speaker Bio “Programming matters: Why Intel’s Many-core approach is so important” James Reinders, Intel Abstract and Speaker Bio “OpenCL on FPGAs: Custom Data Paths for Energy Efficient Computation” Deshanand Singh, Altera Abstract and Speaker Bio Coffee Break 15:00 – 15:30 Session 4: Applications on Heterogeneous Platforms II 15:30 – 17:00 Chair: Kees Vissers, Distinguished Engineer at Xilinx “Key-Value Store Acceleration with OpenPower” Michaela Blott, Xilinx Abstract and Speaker Bio “Text-Analytics Acceleration on Power8 with CAPI” Christoph Hagleitner, IBM Zurich Research Lab Abstract and Speaker Bio “FPGA-Based Monte Carlo Simulation Acceleration on Power8” Daniel Beece, IBM T. J. Watson Research Center Abstract and Speaker Bio Poster Session and Networking 17:00 – 17:45 "On-Chip Memory Design for OpenCL Heterogeneous Computing Platform" Cheng-Chian Lin, Yi-Chiao Lin, Jun-Wei Lin, Bo-Yi Li, Ching-Lun Lin, Ing-Chao Lin, Da-Wei Chang and Alvin W.-Y. Su, National Cheng Kung University, Taiwan "XLOOPS: Explicit Loop Specialization" Shreesha Srinath, Berkin Illbeyi, Mingxing Tan, Gai Liu, Zhiru Zhang and Christopher Batten, Cornell University, USA "Heterogeneous Platforms for Big Data Applications" Christian Brugger, Christian De Schryver and Norbert Wehn, University of Kaiserslautern, Germany "Investigating the Opportunities for Statically Predicting and Dynamically Prefetching in CUDA/OPENCL" Ahmad Lashgar and Amirali Baniasadi, University of Victoria, Canada "Acceleration Experiences Using Heterogeneous Computing Platforms" Alastair McKinley, Scott Fischaber, Paul Barber and Roger Woods, Analytics Engines Ltd., United Kingdom PROGRAM DETAILS ASPIRE: Specializing the Software and Hardware Stacks Krste Asanovic received a B.A. in Electrical and Information Sciences from Cambridge University in 1987 and a Ph.D. in Computer Science from U.C. Berkeley in 1998. He was an Assistant and Associate Professor of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology, Cambridge, from 1998 to 2007. In 2007, he joined the faculty at the University of California, Berkeley, where he co-founded the Berkeley Parallel Computing Laboratory. He is currently a Professor of Electrical Engineering and Computer Sciences and Director of the Berkeley ASPIRE Laboratory, which is developing new techniques to increase computing efficiency above the transistor level. He is an IEEE Fellow and an ACM Distinguished Scientist. Back to Top Acceleration-Rich Architectures - from Single-chip to Datacenters Jason Cong received his B.S. degree in computer science from Peking University in 1985, his M.S. and Ph.D. degrees in computer science from the University of Illinois at Urbana-Champaign in 1987 and 1990, respectively. Currently, he is a Chancellor’s Professor at the Computer Science Department of University of California, Los Angeles, the director of Center for Domain-Specific Computing (CDSC), codirector of UCLA/Peking University Joint Research Institute in Science and Engineering, and co-director of the VLSI CAD Laboratory. He served as the chair the UCLA Computer Science Department from 2005 to 2008. Dr. Cong’s research interests include synthesis of VLSI circuits and systems, energy-efficient computer architectures, reconfigurable systems, nanotechnology and systems, and highly scalable algorithms. He has over 400 publications in these areas, including 10 best paper awards, and the 2011 ACM/IEEE A. Richard Newton Technical Impact Award in Electric Design Automation. He was elected to an IEEE Fellow in 2000 and ACM Fellow in 2008. He is the recipient of the 2010 IEEE Circuits and System (CAS) Society Technical Achievement Award "For seminal contributions to electronic design automation, especially in FPGA synthesis, VLSI interconnect optimization, and physical design automation." Dr. Cong has graduated 31 PhD students. Nine of them are now faculty members in major research universities, including Cornell, Fudan University, Georgia Tech., Peking University, Purdue, SUNY Binghamton, UCLA, UIUC, and UT Austin. Dr. Cong has successfully founded/co-founded three companies with his students for technology transfer, including Aplus Design Technologies for FPGA physical synthesis and architecture evaluation (acquired by Magma in 2003, now part of Synopsys), AutoESL Design Technologies for high-level synthesis (acquired by Xilinx in 2011), and Neptune Design Automation for ultra-fast FPGA physical design (acquired by Xilinx in 2013). Dr. Cong is also a distinguished visiting professor at Peking University. Back to Top Architecting and Exploiting Asymmetry in Multi-Core Architectures Onur Mutlu is the Strecker Early Career Professor at Carnegie Mellon University. His broader research interests are in computer architecture and systems, especially in the interactions between languages, operating systems, compilers, and microarchitecture. He enjoys teaching and researching problems in computer architecture, including those related to the design of memory/storage systems, multi-core architectures, and scalable and efficient systems. He obtained his PhD and MS in ECE from the University of Texas at Austin (2006) and BS degrees in Computer Engineering and Psychology from the University of Michigan, Ann Arbor. Prior to Carnegie Mellon, he worked at Microsoft Research (2006-2009), Intel Corporation, and Advanced Micro Devices. He was a recipient of the IEEE Computer Society Young Computer Architect Award, Intel Early Career Faculty Honor Award, Faculty partnership Awards from IBM, HP, and Microsoft, four best paper awards, and a number of "computer architecture top pick" paper selections by the IEEE Micro magazine. For more information, please see his webpage at http://www.ece.cmu.edu/~omutlu Back to Top FCUDA: the CUDA to FPGA Compiler Deming Chen obtained his BS in computer science from University of Pittsburgh, Pennsylvania in 1995, and his MS and PhD in computer science from University of California at Los Angeles in 2001 and 2005 respectively. He worked as a software engineer between 1995-1999 and 2001-2002. He has been an associate professor in the ECE department of University of Illinois, Urbana-Champaign since 2011. He is a research associate professor in the Coordinated Science Laboratory and an affiliate associate professor in the CS department. His current research interests include system-level and high-level synthesis, nanosystems design and nano-centric CAD techniques, GPU optimization, reconfigurable computing, hardware/software co-design, hardware security, and computational biology. Dr. Chen is a technical committee member for a series of conferences and symposia, including FPGA, ASPDAC, ICCD, ISQED, DAC, ICCAD, DATE, ISLPED, FPL, etc. He also served as session chair, panelist, panel organizer, or moderator for these and other conferences. He is the TPC Subcommittee or Track Chair for ASPDAC'09-11 and '13, ISVLSI'09, ISCAS'10-11, VLSI-SoC'11, ICCAD'12, ICECS'12, ISLPED'14, and ICCD'14. He is the General Chair for SLIP'12, the CANDE Workshop Chair in 2011, the Program Chair for PROFIT'12, and Program Chair for FPGA'15. He is or has been an associated editor for TCAD (IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems), TODAES (ACM Transactions on Design Automation of Electronic Systems), TVLSI (IEEE Transactions on Very Large Scale Integration Systems), TCAS-I (IEEE Transactions on Circuits and Systems I), JCSC (Journal of Circuits, Systems and Computers), and JOLPE (Journal of Low Power Electronics). He obtained the Achievement Award for Excellent Teamwork from Aplus Design Technologies in 2001, the Arnold O. Beckman Research Award from UIUC in 2007, the NSF CAREER Award in 2008, and five Best Paper Awards for ASPDAC'09, SASP'09, FCCM'11, SAAHPC'11, and CODES+ISSS'13. He is included in the List of Teachers Ranked as Excellent in 2008. He received the ACM SIGDA Outstanding New Faculty Award in 2010 and IBM Faculty Award in 2014. He is a senior member of IEEE. Dr. Chen was involved in two startup companies. He implemented his published algorithm on CPLD technology mapping when he was a software engineer in Aplus Design Technologies, Inc. in 2001, and the software was exclusively licensed by Altera and distributed to many customers of Altera worldwide. He is one of the inventors of the xPilot High Level Synthesis package developed at UCLA, which was licensed to AutoESL Design Technologies, Inc. Aplus was acquired by Magma in 2003, and AutoESL was acquired by Xilinx in 2011. Back to Top Towards Efficient Next-Generation Genome Sequencing Mishali Naik is a senior technology architect in Data Center Group at Intel Corporation. She is currently investigating the demands of emerging Big Data workloads specifically Genomics to drive future cluster architecture. She is collaborating with leading institutions including Broad, OHSU, UCLA to understand Genomics workloads, optimize SW for current Intel architecture, as well as investigate energy efficient techniques for future clusters. She received her Ph.D. degree in Computer Science from University of California, Los Angeles. Gans Srinivasa is a senior principal engineer with the Data Center Group at Intel Corporation. He managed the architecture definition of several generations of the Intel Xeon server product line. Currently, he is leading research activities on heterogeneous computing for energy and power efficiency focused on Genomics and Genomics driven Big data. In 93, he joined Intel where he led the efforts on VLSI CAD and created the Area Routing concepts for Intel’s microprocessor design; created Cloud computing in 2000 and later led multi core Xeon Architecture for the past 12 years. He currently spends most of the days working at OHSU splitting his time collaborating with our partners in Genomics area. Back to Top Bringing Shared-Memory Reconfigurable Logic to the Datacenter and the Cloud Abstract: The addition of shared-memory reconfigurable logic into mainstream servers opens up new possibilities for system optimization. In this talk we we explain why it is important to pursue a sharedmemory architecture and we discuss what reconfigurable logic can be effectively used for. We also discuss how these ideas have been realized in the OpenPOWER platform. Dr. Peter Hofstee currently works at the IBM Austin Research Laboratory on workload-optimized and hybrid systems. Peter has degrees in theoretical physics (MS, Rijks Universiteit Groningen, Netherlands) and computer science (PhD, California Inst. of Technology). At IBM Peter has worked on microprocessors, including the first CMOS processor to demonstrate GHz operation (1997), and he was the chief architect of the synergistic processor elements in the Cell Broadband Engine, known from its use in the Sony Playstation 3 and the Roadrunner supercomputer that first broke the 1 Petaflop Linpack benchmark. His interests include VLSI, multicore and heterogeneous microprocessor architecture, security, system design and programming. Peter has over 100 patents issued or pending. Back to Top Programming matters: Why Intel’s Many-core approach is so important Abstract: James will provide an overview of the Intel Xeon Phi coprocessor with a focus on why programmability is the key distinction. James will explain the vision behind Intel’s approach in terms of programmability. Motivation will be drawn from the challenges in front of parallel programmers and the resulting importance of scaling, dynamic load-balancing, explicit vectorization, nest parallelism and composability. James will share examples from customers, including some covered in an upcoming book. James Reinders is a Parallel Programming Evangelist at Intel. James is involved in multiple engineering, research and educational efforts to increase use of parallel programming throughout the industry. He joined Intel Corporation in 1989, and has contributed to numerous projects including the world's first TeraFLOP/s supercomputer (ASCI Red) and the world's first TeraFLOP/s microprocessor (Intel® Xeon Phi™ coprocessor). James been an author on numerous technical books, including VTune™ Performance Analyzer Essentials (Intel Press, 2005), Intel® Threading Building Blocks (O'Reilly Media, 2007), Structured Parallel Programming (Morgan Kaufmann, 2012), Intel® Xeon Phi™ Coprocessor High Performance Programming (Morgan Kaufmann, 2013), and Multithreading for Visual Effects (A K Peters/CRC Press, 2014). James is co-editing a book of multicore/many-core programming examples scheduled to be published in late 2014. Back to Top OpenCL on FPGAs: Custom Data Paths for Energy Efficient Computation Abstract: In this talk we explore two major developments in FPGA architecture and design description, which together transform the landscape to enable FPGAs as an attractive accelerator technology in a wider range of conventional and embedded applications. Specifically, we describe the OpenCL high level design flow offered by Altera which provides a full system solution, eliminating the challenges of external world interfacing and timing closure, and bringing FPGAs closer to the development experience offered by other accelerator technologies. We further explore new technologies incorporated into the OpenCL solution such as inter-kernel channels and task-based design description, which enable natural high level expression of highly pipelined algorithms. Finally, we describe the capabilities of Altera’s new floating point DSP blocks, which eliminate the resource penalty of floating point implementations on an FPGA. Through the use of several case studies, we will demonstrate that the use of an OpenCL programming model in conjunction with floating point DSP blocks can lead to algorithm implementations that are both high performance and low power. Desh Singh is a Director of Software Engineering at Altera's Toronto Technology Center. Desh’s mandate is to develop high level design tools that allow designers to create applications for FPGAs with a higher level of productivity than traditionally possible. His responsibilities include OpenCL, High Level Synthesis, DSP Builder, Interconnect Fabrics and Altera’s University Program. Desh holds a PhD from the University of Toronto in the area of timing closure techniques for high speed FPGA designs and has authored over 60 patents and publications on FPGA technology. Back to Top Key-Value Store Acceleration with OpenPower Abstract: Distributed key-value stores such as memcached form a critical middleware application within today’s web infrastructure. However, typical multicore multithreaded systems yield limited performance scalability and high power consumption as their architecture with its optimization for single thread performance is not well-matched towards the memory-intensive and parallel nature of this application. We present an architecture and implementation of an accelerated key-value store appliance that gives measured 36x in performance/power at response times in the microsecond range. Through the coherent integration of memory through IBM’s OpenPower architecture, we can provide for economic scaling of value store density to terabytes utilizing host memory and CAPI-attached flash as value store. Furthermore, the inherent coherency streamlines the partitioning of functionality between Power8 and accelerators, reduces code and ultimately simplifies the process by which we can add functionality such as data analytics to the platform. Michaela Blott graduated from the University of Kaiserslautern in Germany. She worked in both research institutions (ETH and Bell Labs) as well as development organizations and was deeply involved in large scale international collaborations such as NetFPGA-10G. Her expertise spreads high-speed networking, emerging memory technologies, data centers and distributed computing systems with a focus on FPGA-based implementations. Today, she works as a senior research scientist at the Xilinx labs in Dublin heading a team of international researchers. Her key responsibility is exploring applications, system architectures and new design flows for FPGAs in data centers. Back to Top Text-Analytics Acceleration on Power8 with CAPI Abstract: The amount of text data has reached a new scale and continues to grow at an unprecedented rate. IBM’s SystemT software is a powerful text analytics system, which offers a query-based interface to reveal the valuable information that lies within these mounds of data. Traditional server architectures are not capable of analyzing the so-called "Big Data" in an efficient way, despite the high memory bandwidth that is available. We show that by using a hardware accelerator implemented in reconfigurable logic, the throughput rates of the SystemT’s information extraction queries can be improved by an order of magnitude. We demonstrate how such a system can be deployed by extending SystemT’s existing compilation flow and by using the cache-coherent accelerator interface on IBMs latest Power8 processor. Christoph Hagleitner obtained a diploma degree and a Ph.D. degree in Electrical Engineering from the Swiss Federal Institute of Technology (ETH), Zurich in 1997 and 2002, respectively. During his Ph.D. work, Christoph Hagleitner specialized in interface circuitry and system aspects of CMOS integrated micro- and nanosystems. In 2003 he joined IBM Research - Zürich in Rüschlikon, Switzerland where he worked on the system architecture and mixed-signal design of a novel probe-storage device. Since 2008, he manages the Accelerator technologies group. The main research direction of the group are performance / energy-optimized HW accelerators for next-generation computing systems. He has authored and co-authored several book chapters and 70+ papers in journals and conference proceedings. Back to Top FPGA-Based Monte Carlo Simulation Acceleration on Power8 Abstract: We describe an FPGA implementation of a financial Monte Carlo pricing function for an IBM Power8. The FPGA is an Altera Stratix V on a Coherence Attach Processor Interface (CAPI) card, a PCIe card available for Power8. The FPGA computes the cash flows for the accumulator forward option, for which an analytic solution does not exist. We will provide an overview of the Power8 CAPI architecture and give details of the FPGA design, which includes a Sobol quasi-random number generator, an accurate inverse normal function, stock price paths generated using a Black Scholes model of Brownian motion, the final cash flow/payout computation, and the logic to support communication with the Power8 host. Our implementation of the pricing function is at least fifty times faster than an equivalent software implementation running on the Power8 host, and thus offers a fast, low cost accelerator for portfolio risk analysis. Daniel Beece received the Bachelors of Science in Engineering Physics at Cornell University, Ithaca, New York, and the Masters and Ph.D. in Physics from the University of Illinois in Urbana-Champaign. He has been a Research Staff Member at the T.J. Watson Research Center at IBM since 1982. He has worked in the areas of circuit timing and optimization, logic and function VLSI system simulation, hardware accelerators for VLSI, highly parallel systems and special purpose architectures. He currently works in the area of VLSI systems, design automation and application acceleration. Back to Top