Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association Analyzing the Multicore Ecosystem • Embedded Microprocessor Benchmark Consortium® (EEMBC) • Industry benchmarks since 1997 • Tool for evaluating embedded processors, compilers, systems • MultiBench for shredding multicore processors Enabling the Multicore Ecosystem • Initial engagement began in May 2005 • Industry-wide participation • Current efforts • • • • Communications APIs Hypervisors Multicore Programming Practices Resource Management Multicore Issues to Solve • Concurrent programming • Communications, synchronization, resource management between/among cores • Performance analysis • Debugging • Distributed power management • OS virtualization • Modeling and simulation • Load balancing • Algorithm partitioning Multicore Benchmarking Rules • Do not rely on a single answer • Match your application requirements – Small or large data sets – Few or many threads – Dependencies – OS overhead Benchmarking Multicore – What’s Important? • • • • • • Measuring scalability Memory and I/O bandwidth Inter-core communications OS scheduling support Efficiency of synchronization System-level functionality EEMBC Multicore Strategy • Evaluation and future development of scalable SMP architectures – Includes multicore and manycore • Measure impact of parallelization and scalability across both data processing and computationally-intensive tasks Some Results Huge drop in performance when oversubscribed Nice scaling on networking only workloads Quad Core 4.5 4 3.5 Speedup 3 2.5 2 1.5 1 Some benchmarks plateau earlier then expected 0.5 0 1 2 4 8 Number of concurrent streams 64M-check-reassembly 64M-cmykw2 64M-rotatew2 MultiBench Results 64M-tcp-mixed Two Processor System Utilizing Single Memory Controller Quad Core Processor 1 Intel Front Side Bus North Bridge DDR2 Interface DDR2 FB DIMM Memory Quad Core Processor 2 Processors 1 and 2 must always arbitrate for memory via their front side bus connection through the North Bridge. 0 1. Find max values for each workload 2. Ratio of all scores rotate-color-4M-90deg rotate-34kX512-90deg rotate-16x4Ms4w8 rotate-16x4Ms4w4 rotate-16x4Ms4w2 rotate-16x4Ms4w1 rotate-16x4Ms32w8 rotate-16x4Ms32w4 rotate-16x4Ms32w2 rotate-16x4Ms32w1 rotate-16x4Ms1w8 rotate-16x4Ms1w4 rotate-16x4Ms1w2 2 rotate-16x4Ms1w1 rgbcmyk-5x12M8workers rgbcmyk-5x12M4workers rgbcmyk-5x12M2workers rgbcmyk-5x12M1workers md5-32M4worker md5-32M2worker md5-32M1worker ipres-72M2worker ipres-72M1worker ippktcheck-64M-1Worker 64M-x264-8workers 64M-x264-4workers 64M-x264-2workers 64M-x264-1worker 64M-tcp-mixed 64M-rotatew2 64M-cmykw2-rotatew2 64M-cmykw2 64M-check-reassembly-tcp-h264w2 64M-check-reassembly-tcp-cmykw2-rotatew2 64M-check-reassembly-tcp 64M-check-reassembly Dual Quads vs. Single Quad 2.5 No Workloads Yield 2x Scaling 1.5 1 0.5 DDR2 Interface Quad Core Processor 1 Link Quad Core Processor 2 Direct Access Shared Access Doubly Shared Access DDR2 Interface DDR2 ECC Registered Memory DDR2 ECC Registered Memory Two Processor System Utilizing Dual Memory Controllers Quad Core Processor 1 Link Quad Core Processor 2 DDR2 Interface • Processor 1 must always access memory by traversing link to Processor 2 • Requires arbitration to access Processor 2’s memory • Processor 2 always has prioritized access to this memory since it is directly attached. • Affinity can help performance DDR2 ECC Registered Memory Two Processor System Utilizing Single Memory Controller 4.5 4 3.5 3 64M-cmykw2 2.5 64M-cmykw2-rotatew2 2 64M-rotatew2 1.5 1 0.5 Dual Quad Cores – Dual Memory Controllers 0 1 2 3 4 6 8 12 16 20 Dual Quad Cores – Single Memory Controller 4.5 4 3.5 3 64M-cmykw2 2.5 64M-cmykw2-rotatew2 2 64M-rotatew2 1.5 1 0.5 0 1 2 3 4 6 8 12 16 20 •Rotation by multiple workers cooperating to process a single image. •Each worker thread acquires slices and writes them to the output buffer. •Potential bottlenecks related to memory interfaces and synchronization between worker threads. 4 3.5 3 2.5 rotate-color-4M-90deg (DS) 2 rotate-color-4M-90deg (DD) 1.5 1 0.5 0 1 2 3 4 5 6 7 8 9 8 7 6 5 rotate-16x4Ms32w1 rotate-16x4Ms32w2 4 rotate-16x4Ms32w4 rotate-16x4Ms32w8 3 2 Dual Quad Cores – Dual Memory Controllers 1 0 1 2 3 4 5 6 7 8 Dual Quad Cores – Single Memory Controller 9 8 7 6 5 rotate-16x4Ms32w1 rotate-16x4Ms32w2 4 rotate-16x4Ms32w4 rotate-16x4Ms32w8 3 2 1 0 1 2 3 4 5 6 7 8 9 The Multicore Association Roadmap Services Value Added Functions •Load Balancing •System Mgt. •Power Mgt. •Reliability •Quality of Service • Languages • Programming Models • Design Environments • Application Generators • Benchmarks The Four MCA Pillars Communications - MCAPI: ultra-light weight Communication Resource Management Task Management MCA Foundation APIs Debug Virtualization (or OS) Adopted stds Multicore System Resource Management - Memory management - Basic synchronization - Resource registration - Resource partitioning Task Management -Task scheduling Why Virtualize? • • • • • Hardware consolidation Migration and hosting of legacy applications Resource management and balancing Faster provisioning Fault tolerance Virtualized Multicore System CORE #1 CORE #2 CORE #3 CORE #4 Fractional CPU assignment for background tasks such as firmware update/system health monitor/power management Benchmarking involves many system-level elements EEMBC Hypermark Important Metrics • Overall performance overhead of a hypervisor, i.e. CPU loading • Static footprint (code and data size) • Jitter • Interrupt latency • Comparison to native performance Multicore Programming Practices (MPP) • Long term – Continue research into languages, methodologies, etc • Short term – How today’s embedded C/C++ code may be written to be “multicore ready” • Influence of a group of like-minded methodology experts to ensure completeness, usefulness and industry-wide compatibility • Creation of a standard “best practices” guide through a recognized, neutral industry body – Based on capturing current best practices 20 MPP Scope & Approach • Focus on existing C/C++ without extensions, targeting current architectures • Draw up framework of common pitfalls when transitioning from serial to parallel • Consider solutions or avoidance tactics • Analyze performance of solution Sequential C/C++ Architecture 1 Architecture 2 e.g. Shared e.g. Programmer Memory Managed Memory 21 Architecture 3 e.g. Message Passing Summary • Performance analysis highlights benefits and pitfalls • EEMBC licensing and membership • Portability prevents entrapment • Multicore Association standards and working groups