HCP'14 Program Details

advertisement
ICCAD 2014 Workshop: Heterogeneous Computing Platforms (HCP)
Thursday, November 6, 2014, Hilton San Jose, CA
Technical Program
Opening Remarks
8:15 – 8:30
Organizers: Gi-Joon Nam (IBM) and Mustafa Ozdal (Intel)
Session 1: Emerging Heterogeneous Architectures
8:30 – 10:30
Chair: Mustafa Ozdal, Intel
"ASPIRE: Specializing the Software and Hardware Stacks”
Krste Asanovic, University of California, Berkeley
Speaker Bio
“Acceleration-Rich Architectures – from Single-chip to Datacenters”
Jason Cong, University of California Los Angeles
Speaker Bio
“Architecting and Exploiting Asymmetry in Multi-Core Architectures”
Onur Mutlu, Carnegie Mellon University
Speaker Bio
Coffee Break
10:30 – 11:00
Session 2: High-Level Synthesis and Applications on Heterogeneous Platforms
11:00 – 12:00
Chair: Gi-Joon Nam, IBM Research
“FCUDA: the CUDA to FPGA Compiler”
Deming Chen, University of Illinois at Urbana-Champaign
Speaker Bio
“Towards Efficient Next-Generation Genome Sequencing”
Mishali Naik and Ganapati Srinivasa, Intel
Speaker Bios
Lunch
12:00 – 13:00
Session 3: Latest Heterogeneous Platforms and Programming Models
13:00 – 15:00
Chair: Danny Bathen, Intel
“Bringing Shared-Memory Reconfigurable Logic to the Datacenter and the Cloud”
Peter Hofstee, IBM Austin Research Lab
Abstract and Speaker Bio
“Programming matters: Why Intel’s Many-core approach is so important”
James Reinders, Intel
Abstract and Speaker Bio
“OpenCL on FPGAs: Custom Data Paths for Energy Efficient Computation”
Deshanand Singh, Altera
Abstract and Speaker Bio
Coffee Break
15:00 – 15:30
Session 4: Applications on Heterogeneous Platforms II
15:30 – 17:00
Chair: Kees Vissers, Distinguished Engineer at Xilinx
“Key-Value Store Acceleration with OpenPower”
Michaela Blott, Xilinx
Abstract and Speaker Bio
“Text-Analytics Acceleration on Power8 with CAPI”
Christoph Hagleitner, IBM Zurich Research Lab
Abstract and Speaker Bio
“FPGA-Based Monte Carlo Simulation Acceleration on Power8”
Daniel Beece, IBM T. J. Watson Research Center
Abstract and Speaker Bio
Poster Session and Networking
17:00 – 17:45
"On-Chip Memory Design for OpenCL Heterogeneous Computing Platform"
Cheng-Chian Lin, Yi-Chiao Lin, Jun-Wei Lin, Bo-Yi Li, Ching-Lun Lin, Ing-Chao Lin, Da-Wei Chang and Alvin
W.-Y. Su, National Cheng Kung University, Taiwan
"XLOOPS: Explicit Loop Specialization"
Shreesha Srinath, Berkin Illbeyi, Mingxing Tan, Gai Liu, Zhiru Zhang and Christopher Batten, Cornell
University, USA
"Heterogeneous Platforms for Big Data Applications"
Christian Brugger, Christian De Schryver and Norbert Wehn, University of Kaiserslautern, Germany
"Investigating the Opportunities for Statically Predicting and Dynamically Prefetching in CUDA/OPENCL"
Ahmad Lashgar and Amirali Baniasadi, University of Victoria, Canada
"Acceleration Experiences Using Heterogeneous Computing Platforms"
Alastair McKinley, Scott Fischaber, Paul Barber and Roger Woods, Analytics Engines Ltd., United Kingdom
PROGRAM DETAILS
ASPIRE: Specializing the Software and Hardware Stacks
Krste Asanovic received a B.A. in Electrical and Information Sciences from Cambridge University in 1987
and a Ph.D. in Computer Science from U.C. Berkeley in 1998. He was an Assistant and Associate
Professor of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology,
Cambridge, from 1998 to 2007. In 2007, he joined the faculty at the University of California, Berkeley,
where he co-founded the Berkeley Parallel Computing Laboratory. He is currently a Professor of
Electrical Engineering and Computer Sciences and Director of the Berkeley ASPIRE Laboratory, which is
developing new techniques to increase computing efficiency above the transistor level. He is an IEEE
Fellow and an ACM Distinguished Scientist.
Back to Top
Acceleration-Rich Architectures - from Single-chip to Datacenters
Jason Cong received his B.S. degree in computer science from Peking University in 1985, his M.S. and
Ph.D. degrees in computer science from the University of Illinois at Urbana-Champaign in 1987 and
1990, respectively. Currently, he is a Chancellor’s Professor at the Computer Science Department of
University of California, Los Angeles, the director of Center for Domain-Specific Computing (CDSC), codirector of UCLA/Peking University Joint Research Institute in Science and Engineering, and co-director
of the VLSI CAD Laboratory. He served as the chair the UCLA Computer Science Department from 2005
to 2008. Dr. Cong’s research interests include synthesis of VLSI circuits and systems, energy-efficient
computer architectures, reconfigurable systems, nanotechnology and systems, and highly scalable
algorithms. He has over 400 publications in these areas, including 10 best paper awards, and the 2011
ACM/IEEE A. Richard Newton Technical Impact Award in Electric Design Automation. He was elected to
an IEEE Fellow in 2000 and ACM Fellow in 2008. He is the recipient of the 2010 IEEE Circuits and System
(CAS) Society Technical Achievement Award "For seminal contributions to electronic design automation,
especially in FPGA synthesis, VLSI interconnect optimization, and physical design automation."
Dr. Cong has graduated 31 PhD students. Nine of them are now faculty members in major research
universities, including Cornell, Fudan University, Georgia Tech., Peking University, Purdue, SUNY
Binghamton, UCLA, UIUC, and UT Austin. Dr. Cong has successfully founded/co-founded three
companies with his students for technology transfer, including Aplus Design Technologies for FPGA
physical synthesis and architecture evaluation (acquired by Magma in 2003, now part of Synopsys),
AutoESL Design Technologies for high-level synthesis (acquired by Xilinx in 2011), and Neptune Design
Automation for ultra-fast FPGA physical design (acquired by Xilinx in 2013). Dr. Cong is also a
distinguished visiting professor at Peking University.
Back to Top
Architecting and Exploiting Asymmetry in Multi-Core Architectures
Onur Mutlu is the Strecker Early Career Professor at Carnegie Mellon University. His broader research
interests are in computer architecture and systems, especially in the interactions between languages,
operating systems, compilers, and microarchitecture. He enjoys teaching and researching problems in
computer architecture, including those related to the design of memory/storage systems, multi-core
architectures, and scalable and efficient systems. He obtained his PhD and MS in ECE from the University
of Texas at Austin (2006) and BS degrees in Computer Engineering and Psychology from the University of
Michigan, Ann Arbor. Prior to Carnegie Mellon, he worked at Microsoft Research (2006-2009), Intel
Corporation, and Advanced Micro Devices. He was a recipient of the IEEE Computer Society Young
Computer Architect Award, Intel Early Career Faculty Honor Award, Faculty partnership Awards from
IBM, HP, and Microsoft, four best paper awards, and a number of "computer architecture top pick"
paper selections by the IEEE Micro magazine. For more information, please see his webpage at
http://www.ece.cmu.edu/~omutlu
Back to Top
FCUDA: the CUDA to FPGA Compiler
Deming Chen obtained his BS in computer science from University of Pittsburgh, Pennsylvania in 1995,
and his MS and PhD in computer science from University of California at Los Angeles in 2001 and 2005
respectively. He worked as a software engineer between 1995-1999 and 2001-2002. He has been an
associate professor in the ECE department of University of Illinois, Urbana-Champaign since 2011. He is
a research associate professor in the Coordinated Science Laboratory and an affiliate associate professor
in the CS department. His current research interests include system-level and high-level synthesis, nanosystems design and nano-centric CAD techniques, GPU optimization, reconfigurable computing,
hardware/software co-design, hardware security, and computational biology.
Dr. Chen is a technical committee member for a series of conferences and symposia, including FPGA,
ASPDAC, ICCD, ISQED, DAC, ICCAD, DATE, ISLPED, FPL, etc. He also served as session chair, panelist,
panel organizer, or moderator for these and other conferences. He is the TPC Subcommittee or Track
Chair for ASPDAC'09-11 and '13, ISVLSI'09, ISCAS'10-11, VLSI-SoC'11, ICCAD'12, ICECS'12, ISLPED'14, and
ICCD'14. He is the General Chair for SLIP'12, the CANDE Workshop Chair in 2011, the Program Chair for
PROFIT'12, and Program Chair for FPGA'15. He is or has been an associated editor for TCAD (IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems), TODAES (ACM Transactions
on Design Automation of Electronic Systems), TVLSI (IEEE Transactions on Very Large Scale Integration
Systems), TCAS-I (IEEE Transactions on Circuits and Systems I), JCSC (Journal of Circuits, Systems and
Computers), and JOLPE (Journal of Low Power Electronics). He obtained the Achievement Award for
Excellent Teamwork from Aplus Design Technologies in 2001, the Arnold O. Beckman Research Award
from UIUC in 2007, the NSF CAREER Award in 2008, and five Best Paper Awards for ASPDAC'09, SASP'09,
FCCM'11, SAAHPC'11, and CODES+ISSS'13. He is included in the List of Teachers Ranked as Excellent in
2008. He received the ACM SIGDA Outstanding New Faculty Award in 2010 and IBM Faculty Award in
2014. He is a senior member of IEEE.
Dr. Chen was involved in two startup companies. He implemented his published algorithm on CPLD
technology mapping when he was a software engineer in Aplus Design Technologies, Inc. in 2001, and
the software was exclusively licensed by Altera and distributed to many customers of Altera worldwide.
He is one of the inventors of the xPilot High Level Synthesis package developed at UCLA, which was
licensed to AutoESL Design Technologies, Inc. Aplus was acquired by Magma in 2003, and AutoESL was
acquired by Xilinx in 2011.
Back to Top
Towards Efficient Next-Generation Genome Sequencing
Mishali Naik is a senior technology architect in Data Center Group at Intel Corporation. She is currently
investigating the demands of emerging Big Data workloads specifically Genomics to drive future cluster
architecture. She is collaborating with leading institutions including Broad, OHSU, UCLA to understand
Genomics workloads, optimize SW for current Intel architecture, as well as investigate energy efficient
techniques for future clusters. She received her Ph.D. degree in Computer Science from University of
California, Los Angeles.
Gans Srinivasa is a senior principal engineer with the Data Center Group at Intel Corporation. He
managed the architecture definition of several generations of the Intel Xeon server product line.
Currently, he is leading research activities on heterogeneous computing for energy and power efficiency
focused on Genomics and Genomics driven Big data. In 93, he joined Intel where he led the efforts on
VLSI CAD and created the Area Routing concepts for Intel’s microprocessor design; created Cloud
computing in 2000 and later led multi core Xeon Architecture for the past 12 years. He currently spends
most of the days working at OHSU splitting his time collaborating with our partners in Genomics area.
Back to Top
Bringing Shared-Memory Reconfigurable Logic to the Datacenter and the Cloud
Abstract: The addition of shared-memory reconfigurable logic into mainstream servers opens up new
possibilities for system optimization. In this talk we we explain why it is important to pursue a sharedmemory architecture and we discuss what reconfigurable logic can be effectively used for. We also
discuss how these ideas have been realized in the OpenPOWER platform.
Dr. Peter Hofstee currently works at the IBM Austin Research Laboratory on workload-optimized and
hybrid systems. Peter has degrees in theoretical physics (MS, Rijks Universiteit Groningen, Netherlands)
and computer science (PhD, California Inst. of Technology). At IBM Peter has worked on
microprocessors, including the first CMOS processor to demonstrate GHz operation (1997), and he was
the chief architect of the synergistic processor elements in the Cell Broadband Engine, known from its
use in the Sony Playstation 3 and the Roadrunner supercomputer that first broke the 1 Petaflop Linpack
benchmark. His interests include VLSI, multicore and heterogeneous microprocessor architecture,
security, system design and programming. Peter has over 100 patents issued or pending.
Back to Top
Programming matters: Why Intel’s Many-core approach is so important
Abstract: James will provide an overview of the Intel Xeon Phi coprocessor with a focus on why
programmability is the key distinction. James will explain the vision behind Intel’s approach in terms of
programmability. Motivation will be drawn from the challenges in front of parallel programmers and the
resulting importance of scaling, dynamic load-balancing, explicit vectorization, nest parallelism and
composability. James will share examples from customers, including some covered in an upcoming
book.
James Reinders is a Parallel Programming Evangelist at Intel. James is involved in multiple engineering,
research and educational efforts to increase use of parallel programming throughout the industry. He
joined Intel Corporation in 1989, and has contributed to numerous projects including the world's first
TeraFLOP/s supercomputer (ASCI Red) and the world's first TeraFLOP/s microprocessor (Intel® Xeon
Phi™ coprocessor). James been an author on numerous technical books, including VTune™ Performance
Analyzer Essentials (Intel Press, 2005), Intel® Threading Building Blocks (O'Reilly Media, 2007),
Structured Parallel Programming (Morgan Kaufmann, 2012), Intel® Xeon Phi™ Coprocessor High
Performance Programming (Morgan Kaufmann, 2013), and Multithreading for Visual Effects (A K
Peters/CRC Press, 2014). James is co-editing a book of multicore/many-core programming examples
scheduled to be published in late 2014.
Back to Top
OpenCL on FPGAs: Custom Data Paths for Energy Efficient Computation
Abstract: In this talk we explore two major developments in FPGA architecture and design description,
which together transform the landscape to enable FPGAs as an attractive accelerator technology in a
wider range of conventional and embedded applications. Specifically, we describe the OpenCL high
level design flow offered by Altera which provides a full system solution, eliminating the challenges of
external world interfacing and timing closure, and bringing FPGAs closer to the development experience
offered by other accelerator technologies. We further explore new technologies incorporated into the
OpenCL solution such as inter-kernel channels and task-based design description, which enable natural
high level expression of highly pipelined algorithms. Finally, we describe the capabilities of Altera’s new
floating point DSP blocks, which eliminate the resource penalty of floating point implementations on an
FPGA. Through the use of several case studies, we will demonstrate that the use of an OpenCL
programming model in conjunction with floating point DSP blocks can lead to algorithm
implementations that are both high performance and low power.
Desh Singh is a Director of Software Engineering at Altera's Toronto Technology Center. Desh’s mandate
is to develop high level design tools that allow designers to create applications for FPGAs with a higher
level of productivity than traditionally possible. His responsibilities include OpenCL, High Level Synthesis,
DSP Builder, Interconnect Fabrics and Altera’s University Program. Desh holds a PhD from the University
of Toronto in the area of timing closure techniques for high speed FPGA designs and has authored over
60 patents and publications on FPGA technology.
Back to Top
Key-Value Store Acceleration with OpenPower
Abstract: Distributed key-value stores such as memcached form a critical middleware application within
today’s web infrastructure. However, typical multicore multithreaded systems yield limited performance
scalability and high power consumption as their architecture with its optimization for single thread
performance is not well-matched towards the memory-intensive and parallel nature of this application.
We present an architecture and implementation of an accelerated key-value store appliance that gives
measured 36x in performance/power at response times in the microsecond range. Through the
coherent integration of memory through IBM’s OpenPower architecture, we can provide for economic
scaling of value store density to terabytes utilizing host memory and CAPI-attached flash as value store.
Furthermore, the inherent coherency streamlines the partitioning of functionality between Power8 and
accelerators, reduces code and ultimately simplifies the process by which we can add functionality such
as data analytics to the platform.
Michaela Blott graduated from the University of Kaiserslautern in Germany. She worked in both
research institutions (ETH and Bell Labs) as well as development organizations and was deeply involved
in large scale international collaborations such as NetFPGA-10G. Her expertise spreads high-speed
networking, emerging memory technologies, data centers and distributed computing systems with a
focus on FPGA-based implementations. Today, she works as a senior research scientist at the Xilinx labs
in Dublin heading a team of international researchers. Her key responsibility is exploring applications,
system architectures and new design flows for FPGAs in data centers.
Back to Top
Text-Analytics Acceleration on Power8 with CAPI
Abstract: The amount of text data has reached a new scale and continues to grow at an unprecedented
rate. IBM’s SystemT software is a powerful text analytics system, which offers a query-based interface to
reveal the valuable information that lies within these mounds of data. Traditional server architectures
are not capable of analyzing the so-called "Big Data" in an efficient way, despite the high memory
bandwidth that is available. We show that by using a hardware accelerator implemented in
reconfigurable logic, the throughput rates of the SystemT’s information extraction queries can be
improved by an order of magnitude. We demonstrate how such a system can be deployed by extending
SystemT’s existing compilation flow and by using the cache-coherent accelerator interface on IBMs
latest Power8 processor.
Christoph Hagleitner obtained a diploma degree and a Ph.D. degree in Electrical Engineering from the
Swiss Federal Institute of Technology (ETH), Zurich in 1997 and 2002, respectively. During his Ph.D.
work, Christoph Hagleitner specialized in interface circuitry and system aspects of CMOS integrated
micro- and nanosystems. In 2003 he joined IBM Research - Zürich in Rüschlikon, Switzerland where he
worked on the system architecture and mixed-signal design of a novel probe-storage device. Since 2008,
he manages the Accelerator technologies group. The main research direction of the group are
performance / energy-optimized HW accelerators for next-generation computing systems. He has
authored and co-authored several book chapters and 70+ papers in journals and conference
proceedings.
Back to Top
FPGA-Based Monte Carlo Simulation Acceleration on Power8
Abstract: We describe an FPGA implementation of a financial Monte Carlo pricing function for an IBM
Power8. The FPGA is an Altera Stratix V on a Coherence Attach Processor Interface (CAPI) card, a PCIe
card available for Power8. The FPGA computes the cash flows for the accumulator forward option, for
which an analytic solution does not exist. We will provide an overview of the Power8 CAPI architecture
and give details of the FPGA design, which includes a Sobol quasi-random number generator, an
accurate inverse normal function, stock price paths generated using a Black Scholes model of Brownian
motion, the final cash flow/payout computation, and the logic to support communication with the
Power8 host. Our implementation of the pricing function is at least fifty times faster than an equivalent
software implementation running on the Power8 host, and thus offers a fast, low cost accelerator for
portfolio risk analysis.
Daniel Beece received the Bachelors of Science in Engineering Physics at Cornell University, Ithaca, New
York, and the Masters and Ph.D. in Physics from the University of Illinois in Urbana-Champaign. He has
been a Research Staff Member at the T.J. Watson Research Center at IBM since 1982. He has worked in
the areas of circuit timing and optimization, logic and function VLSI system simulation, hardware
accelerators for VLSI, highly parallel systems and special purpose architectures. He currently works in
the area of VLSI systems, design automation and application acceleration.
Back to Top
Download