Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC

advertisement
Leveling the Field for Multicore
Open Systems Architectures
Markus Levy
President, EEMBC
President, Multicore Association
Analyzing the Multicore Ecosystem
• Embedded Microprocessor Benchmark
Consortium® (EEMBC)
• Industry benchmarks since 1997
• Tool for evaluating embedded processors,
compilers, systems
• MultiBench for shredding multicore
processors
Enabling the Multicore Ecosystem
• Initial engagement began in May 2005
• Industry-wide participation
• Current efforts
•
•
•
•
Communications APIs
Hypervisors
Multicore Programming Practices
Resource Management
Multicore Issues to Solve
• Concurrent programming
• Communications, synchronization, resource
management between/among cores
• Performance analysis
• Debugging
• Distributed power management
• OS virtualization
• Modeling and simulation
• Load balancing
• Algorithm partitioning
Multicore Benchmarking Rules
• Do not rely on a single answer
• Match your application requirements
– Small or large data sets
– Few or many threads
– Dependencies
– OS overhead
Benchmarking Multicore –
What’s Important?
•
•
•
•
•
•
Measuring scalability
Memory and I/O bandwidth
Inter-core communications
OS scheduling support
Efficiency of synchronization
System-level functionality
EEMBC Multicore Strategy
• Evaluation and future development of
scalable SMP architectures
– Includes multicore and manycore
• Measure impact of parallelization and
scalability across both data processing
and computationally-intensive tasks
Some Results
Huge drop in performance
when oversubscribed
Nice scaling on networking
only workloads
Quad Core
4.5
4
3.5
Speedup
3
2.5
2
1.5
1
Some benchmarks plateau earlier
then expected
0.5
0
1
2
4
8
Number of concurrent streams
64M-check-reassembly
64M-cmykw2
64M-rotatew2
MultiBench Results
64M-tcp-mixed
Two Processor System Utilizing Single
Memory Controller
Quad
Core
Processor 1
Intel Front Side Bus
North
Bridge
DDR2
Interface
DDR2 FB DIMM
Memory
Quad
Core
Processor 2
Processors 1 and 2 must always arbitrate for memory via
their front side bus connection through the North Bridge.
0
1. Find max values for each workload
2. Ratio of all scores
rotate-color-4M-90deg
rotate-34kX512-90deg
rotate-16x4Ms4w8
rotate-16x4Ms4w4
rotate-16x4Ms4w2
rotate-16x4Ms4w1
rotate-16x4Ms32w8
rotate-16x4Ms32w4
rotate-16x4Ms32w2
rotate-16x4Ms32w1
rotate-16x4Ms1w8
rotate-16x4Ms1w4
rotate-16x4Ms1w2
2
rotate-16x4Ms1w1
rgbcmyk-5x12M8workers
rgbcmyk-5x12M4workers
rgbcmyk-5x12M2workers
rgbcmyk-5x12M1workers
md5-32M4worker
md5-32M2worker
md5-32M1worker
ipres-72M2worker
ipres-72M1worker
ippktcheck-64M-1Worker
64M-x264-8workers
64M-x264-4workers
64M-x264-2workers
64M-x264-1worker
64M-tcp-mixed
64M-rotatew2
64M-cmykw2-rotatew2
64M-cmykw2
64M-check-reassembly-tcp-h264w2
64M-check-reassembly-tcp-cmykw2-rotatew2
64M-check-reassembly-tcp
64M-check-reassembly
Dual Quads vs. Single Quad
2.5
No Workloads
Yield 2x Scaling
1.5
1
0.5
DDR2
Interface
Quad
Core
Processor 1
Link
Quad
Core
Processor 2
Direct Access
Shared Access
Doubly Shared Access
DDR2
Interface
DDR2 ECC Registered
Memory
DDR2 ECC Registered
Memory
Two Processor System Utilizing
Dual Memory Controllers
Quad
Core
Processor 1
Link
Quad
Core
Processor 2
DDR2
Interface
• Processor 1 must always access memory by traversing link
to Processor 2
• Requires arbitration to access Processor 2’s memory
• Processor 2 always has prioritized access to this memory
since it is directly attached.
• Affinity can help performance
DDR2 ECC Registered
Memory
Two Processor System Utilizing
Single Memory Controller
4.5
4
3.5
3
64M-cmykw2
2.5
64M-cmykw2-rotatew2
2
64M-rotatew2
1.5
1
0.5
Dual Quad Cores –
Dual Memory Controllers
0
1
2
3
4
6
8
12
16
20
Dual Quad Cores –
Single Memory Controller
4.5
4
3.5
3
64M-cmykw2
2.5
64M-cmykw2-rotatew2
2
64M-rotatew2
1.5
1
0.5
0
1
2
3
4
6
8
12
16
20
•Rotation by multiple workers cooperating to process a single image.
•Each worker thread acquires slices and writes them to the output
buffer.
•Potential bottlenecks related to memory interfaces and synchronization
between worker threads.
4
3.5
3
2.5
rotate-color-4M-90deg (DS)
2
rotate-color-4M-90deg (DD)
1.5
1
0.5
0
1
2
3
4
5
6
7
8
9
8
7
6
5
rotate-16x4Ms32w1
rotate-16x4Ms32w2
4
rotate-16x4Ms32w4
rotate-16x4Ms32w8
3
2
Dual Quad Cores –
Dual Memory Controllers
1
0
1
2
3
4
5
6
7
8
Dual Quad Cores –
Single Memory Controller
9
8
7
6
5
rotate-16x4Ms32w1
rotate-16x4Ms32w2
4
rotate-16x4Ms32w4
rotate-16x4Ms32w8
3
2
1
0
1
2
3
4
5
6
7
8
9
The Multicore Association Roadmap
Services
Value Added Functions
•Load Balancing
•System Mgt.
•Power Mgt.
•Reliability
•Quality of Service
• Languages
• Programming Models
• Design Environments
• Application Generators
• Benchmarks
The Four MCA Pillars
Communications
- MCAPI: ultra-light weight
Communication
Resource
Management
Task
Management
MCA Foundation APIs
Debug
Virtualization (or OS)
Adopted stds
Multicore System
Resource Management
- Memory management
- Basic synchronization
- Resource registration
- Resource partitioning
Task Management
-Task scheduling
Why Virtualize?
•
•
•
•
•
Hardware consolidation
Migration and hosting of legacy applications
Resource management and balancing
Faster provisioning
Fault tolerance
Virtualized Multicore System
CORE #1
CORE #2
CORE #3
CORE #4
 Fractional CPU assignment for background tasks such as firmware
update/system health monitor/power management
 Benchmarking involves many system-level elements
EEMBC Hypermark
Important Metrics
• Overall performance overhead of a
hypervisor, i.e. CPU loading
• Static footprint (code and data size)
• Jitter
• Interrupt latency
• Comparison to native performance
Multicore Programming Practices
(MPP)
• Long term
– Continue research into languages, methodologies, etc
• Short term
– How today’s embedded C/C++ code may be written to be
“multicore ready”
• Influence of a group of like-minded methodology experts
to ensure completeness, usefulness and industry-wide
compatibility
• Creation of a standard “best practices” guide through a
recognized, neutral industry body
– Based on capturing current best practices
20
MPP Scope & Approach
• Focus on existing C/C++
without extensions, targeting
current architectures
• Draw up framework of
common pitfalls when
transitioning from serial to
parallel
• Consider solutions or
avoidance tactics
• Analyze performance of
solution
Sequential
C/C++
Architecture 1
Architecture 2
e.g. Shared
e.g. Programmer
Memory
Managed Memory
21
Architecture 3
e.g. Message
Passing
Summary
• Performance analysis highlights benefits
and pitfalls
• EEMBC licensing and membership
• Portability prevents entrapment
• Multicore Association standards and
working groups
Download