DATA STUCTURES FOR MULTI

advertisement
DATA STRUCTURES OPTIMISATION FOR
MANY-CORE SYSTEMS
Matthew Freeman | Supervisor: Maciej Golebiewski
CSIRO Vacation Scholar Program 2013-14
The Multi-core Age
CSIRO ‘Bragg’
Compute Cluster
Mobile Phone
2-4 Cores
2 | Presentation title | Presenter name
PC
Intel Xeon Phi
4-16 Cores
61 Cores
2048 Cores
Programming for multi-cores
Problem
Machine Instructions
Execution
CPU Core 1
CPU Core 2
Divide the
problem
CPU Core 3
CPU Core 4
3 | Presentation title | Presenter name
Amdahl's Law
• The maximum speedup is dependent on
% of the problem you can run in parallel
95%
20x speedup
90%
10x speedup
4x speedup
75%
2x speedup
50%
Single Core
Processor
1x Speed
0
5
10
15
Maximum Speedup
4 | Presentation title | Presenter name
20
25
Data structures:
• Memory (data) is still a shared resource.
Single Core Computer
4-Core Computer
CPU core
CPU core
CPU core
Memory
(data)
Memory (data)
CPU core
5 | Presentation title | Presenter name
CPU core
Linked-list (Stack) Data Structure
A “node” that holds data.
TOP
Data
EMPTY
A link to the next data point
6 | Presentation title | Presenter name
Add new item (Push)
We want to add a chunk of data (Data B) to the structure
Data B
TOP
Data A
7 | Presentation title | Presenter name
EMPTY
Add new item (Push)
Steps: For new data B
1) Find the start of the structure (TOP)
Data B
TOP
Data A
8 | Presentation title | Presenter name
EMPTY
Add new item
Steps: For new data B
2) Link into the structure.
Data B
TOP
Data A
9 | Presentation title | Presenter name
EMPTY
Add new item
TOP (new)
Steps: For new data B
3) Update TOP.
Data B
Data A
10 | Presentation title | Presenter name
NULL
Resulting structure
• Like stacking dinner plates
• Only need to keep track of where TOP is to access the
rest.
TOP
Data
Data
11 | Presentation title | Presenter name
Data
Data
Data
NULL
What happens in multi-core systems?
Two threads trying to operate on the stack structure:
Thread 1 attempts at time T.
Thread 2 attempts at time T + 1 nanosecond.
Because each of the steps takes time to complete, errors occur.
12 | Presentation title | Presenter name
What happens in multi-core systems?
This causes the interleaving of steps
Thread 1 reads TOP (1)
Thread 2 reads TOP (1)
Thread 1 sets the next pointer (2)
Thread 2 sets the next pointer (2)
Thread 1 updates TOP (3)
Thread 2 updates TOP (3)
13 | Presentation title | Presenter name
Data B is lost forever because it is not linked to
TOP anymore  (Stack failure)
Data B
Thread 1
Data A
TOP
Data C
Thread 2
14 | Presentation title | Presenter name
EMPTY
How do we fix this?
• Use “data locks”.
• Protect the 3 steps.
• One thread at a time is granted access to the stack.
• Complete an operation and release the lock.
This is the standard approach for multithreaded
structures.
15 | Presentation title | Presenter name
Locks
Easy to use. 2 lines of code added to fix.
- Get Lock
- Step 1, 2 ,3.
- Release Lock.
× Slow. One thread at a time can use the lock.
This becomes sequential code.
This is the code that cannot run in parallel.
Analogy: Merging highway traffic into a single lane.
16 | Presentation title | Presenter name
Lock-free
New method
• Lock-free data structure.
• Special low-level instructions allows three steps in one
computer instruction.
• Removes the need for locks.
• Called a Compare-Exchange.
17 | Presentation title | Presenter name
Lock-free
• Downside: Writing lock-free code is difficult (hence the
project).
• The Compare-Exchange operation forms the base for
writing lock-free code.
• The project takes specifications from research papers to
implement.
18 | Presentation title | Presenter name
Lock-free
Implemented a range of lock-free optimizations for the stack.
Open coding standards (C++, OpenMP)
Benchmarked using a Intel Xeon Phi 61 core processor.
Lock-free structure performed about 2x better for pure stack
operations.
19 | Presentation title | Presenter name
Summary
Amdahl’s Law shows that it’s important to optimize sequential
sections of code.
The shared data structures are often sequential bottlenecks.
Implementing lock-free data structures reduced this bottleneck.
20 | Presentation title | Presenter name
Download