Godson-T A High-Efficient Many-Core Architecture for Parallel

advertisement
Institute of Computing Technology
Chinese Academy of Sciences
Many-Core Processor Design for
High Performance and High Throughput
Computing
Dongrui Fan, Xiaochun Ye
Institute of Computing Technology, Chinese Academy of Sciences
AMS
Processor Research Plan in AMS
High Throughput Computing
ExtraScale Computing
1000 cores on chip Advanced Simulation
1000 threads on chip
Manycore Research
Godson-T(Tera Flops)
• Fine-Grain
Sychronization
• Resource Sharing
Godson-D (Throughput)
HPC
HTC
2
• Realtime Processing
• Concurrent Independent
Processing
CO-DESIGN 2015
Overview

Microprocessor Architecture Challenges

Godson-T Features for High Performance
Computing

Godson-D Features for High Throughput
Computing
3
CO-DESIGN 2015
Microprocessor Architecture Revolution

Memory wall + Power wall + ILP wall = Brick wall !
What parallel architecture will be successful?
Power Wall
Many-Core
Performance
Multi-Core
ILP Wall
Superscalar
Processor Core
Simpler
Processor Core
Power, Complexity

Parallel architectures come to rescue, superscalar processor is being
replaced by on-chip multi-core / many-core processor
4
CO-DESIGN 2015
Exploiting Polymorphic Parallelism
Instruction-Level
Parallelism
Godson-3
With Powerful
Processing Unit
Heterogenous
Many-Core
With moderate
DLP and strong
TLP exploitation
Godson-T
Thread-Level Parallelism
5
CO-DESIGN 2015
Many-Core Challenges: The 5 P’s

Power efficiency challenge


Performance challenge


How to provide a converged many core solution in a standard programming
environment
Parallelism challenge


How to scale from 1 to 1000 cores – the number of cores is the new
Megahertz
Programming challenge


Performance per watt is the new metric – Dark Silicon becomes serious
when facing 10nm process technology
How to exploit parallelism through software and hardware co-design
Platform challenge

How to maximize the usability, such as, runtime system (OS), compiler &
library, many-core debug…
6
CO-DESIGN 2015
What we could help……
Productivity Side:

Most sequential programmers are not ready to switch into parallel
programming


Unique programming model cannot solve all problems efficiently;


conservative programming model and make incremental improvement?
support multiple programming features efficiently?
Locks are messy

new way to eliminate deadlock?
……
Performance Side:

On-chip synchronization is fast;


Flops are cheap, memory communications are expensive;


handle all synchronizations on-chip?
trade flops for communication latency?
Fine-grained parallelism should be taken into consideration;

enable data-driven thread execution on chip?
……
7
CO-DESIGN 2015
Overview



Microprocessor Architecture Challenges
›
Memory wall + Power wall + ILP wall
›
The 5P’s: Many-core Processor Challenges
Godson-T Features for High Performance
Computing
›
Godson-T architecture overview
›
Architectural supports for multithreading
›
Software runtime system
Godson-D Features for High Throughput
Computing
8
CO-DESIGN 2015
Architecture Overview of Godson-T
Program
Runtime
System
Memory Controller
CORE
GodsonT
Processor
L1$
Private
Memroy
R
Memory Controller
Memory Controller
Processing Core
Private I-$
Fetch &
Decode
Register Files
FP/
INT MAC Sync
Vector
ALU ALU Unit
ALU
Synchronization
Manager
L2
Cache Bank
I/O
Controller
SPM
D$
Local
Memory
DTA
Router
Memory Controller
9
CO-DESIGN 2015
Processing Core
4Byte
FETCH
Instruction
Buffer
1 instruction
DECODE
1 instruction
32Byte 2r/1w
Vector
Register File
32 Bytex 2x1
Vector/
Floating-Point
Unit
Vector/
Fixed-Point
Unit

ISA: MIPS (user), SIMD-ext., sync-ext.

8-stage pipeline

Dual-issue per thread

Expected SIMD

L0_Icache
L0_Dcache
16Byte
16Byte
Load/Store/
Synchronization
Unit

Fast level-1 memory

32KB private memory

Automatically mapped into stack
address space

Full/empty bit tagged on each 64-bit
slot enables efficient producerconsumer style synchronization
16Byte in/out msg
8Byte
SPM/
L1 Cache
Router

10
Load, Store, Data Move,
Arithmetic ……
Communication with external modules
through message packets
CO-DESIGN 2015
Interconnection


Separated routers with each processing core

Static XY wormhole routing

Round-Robin arbitration

Duplex 128bit link for each connection
Guaranteed in-order point-to-point communication

Deadlock-free & livelock-free

Scalable and power-efficient for MESH topology

Low latency core-to-core communication

Separated networks allowing traffic segregation and tolerating burst
DMA transfer
11
CO-DESIGN 2015
Memory Hierarchy
Data Transfer
Bandwidth
Reg
Each processing core with 32 fixedpoint and 32 floating-point registers
Local
Memory
32KB local memory, including L1$D,
SPM
512GB/s
128GB/s
16 address-interleaved L2 cache
banks, 128KB each
L2 Cache
51.2GB/s
Off-Chip Memory
4 DDR3-1600 memory controllers
12
CO-DESIGN 2015
Data Communication & Synchronization
Data Transfer Agent (DTA)
DTA horizontal operation
DTA vertical operation
DTA vertical operation

SPM
DTA
L2
Cache
SPM
……
DTA
Router
……
Router
……
off-chip
memory
Router
……
Programmable asynchronous
data transfer agent
 Support vertical and horizontal
DTA operations (such as
prefetch)

Data transfers between multidimension addresses (such as
matrix inversion )

Network load perception,
automatically flow control
(improve bandwidth-efficiency)

Support fine-grain synchronous
operations
on-chip network
(a) vertical and horizontal DTA operations
chunk stride
chunk stride
……
block
block stride
……
block stride
invovled data block address
(b) 2D strided DTA operations
13
CO-DESIGN 2015
Data Communication & Synchronization
Region-Based Cache Coherent Protocol

Lazy cache coherent protocol




Pure mutual-exclusion synchronization instruction without memory
accesses;
Eliminate busy-waiting for locks;
More scalable than bus-snoopy and directory-based cache protocol;
Enable hardware deadlock detecting.
int mutual_func_a()
{
……
acquire_lock(LOCK_VAL);
X = 3;
release_lock(LOCK_VAL);
……
}
X=0
Mini Core
Mini Core
Private Cache
Private Cache
ACQ.
REL.
X=3
ACQ.
REL.
X=0
int mutual_func_b()
{
……
acquire_lock(LOCK_VAL);
Y = X;
Z = X;
release_lock(LOCK_VAL);
……
}
Interconnect
ACK.
CORE 0 UNLOCK
LOCKED
CORE 1 WAITING
UNLOCK
LOCKED
X=3
X=0
Synchronization
Manager
Shared L2 Cache
14
CO-DESIGN 2015
Pthread-like Thread Execution using Private
Memory



Support master-slave style thread execution;
Programming easily: C programming model + Pthread API;
Compatible with large amount of conventional codes on shared
memory system.
int slave_func(int argc, int argv[])
{
// do computation;
……
godsont_exit(0);
}
int mater_func()
{
……
int thread_id =
godsont_ create_thread(slave_func, 2, param_array);
……
}
Stack
Pointer
1. Master transmits
parameters to slave;
2. Master setups PC and
register context for slave;
3. Slave function begins
execution, private memory
acts as a local stack.
SPM
PC
ab
Interconnect
Tens of cycles to create or terminate a thread.
15
CO-DESIGN 2015
GodRunner
Static
Software Stack
The programmers majorly focus
on expressing parallelism, While the
processor and runtime system
concentrate on efficient parallel
executions.
The library provides a rich set of
Software APIs for task management,
including batch thread management.
Programmer is responsible for
creating,terminating and
synchronizing tasks by inserting
appropriate Pthreads-like APIs.
Applications
(ANSI C Programs)
Pthreads-like
C Libraries
Library
GNU C Compiler and
Binutils
Application Binary
Dynamic
“GodRunner” Runtime System
Godson-T Architecture
Simulator (GAS)
Hardware
16
CO-DESIGN 2015
Evaluating the Scalability of
Performance
64
pFind (32K pep.)
iBlastP (600 seq.)
FFT (m=16)
RADIX (n=256k)
LU (n=512)
CHOLESKY (tk29.O)
56
48
Speedup
40
32
24
16
8
0
0
8
16
24
32
40
48
56
64
# of Threads
Speedup of the bioinformatics and SPLASH-2 benchmarks on Godson-T
17
CO-DESIGN 2015
Summary of Godson-T



Exploiting parallelism on multiple level

Without disturbing DLP, Can merge with DLP processors

Exploiting fine-grained data-driven TLP
A scalable many-core architecture research platform

Lock-based cache coherence protocol

Asynchronous DTA and hardware-supported synchronization mechanisms

Pthreads-like programming model, and versatile parallel libraries

Flexible to extend, easy to use, and simple to implement
Keep balance between software and hardware to gain
programmability and performance
18
CO-DESIGN 2015
Overview



Microprocessor Architecture Challenges
›
Memory wall + Power wall + ILP wall
›
The 5P’s: Many-core Processor Challenges
Godson-T Features for High Performance
Computing
›
Godson-T architecture overview
›
Architectural supports for multithreading
›
Software runtime system
Godson-D Features for High Throughput
Computing
›
Motivation of DPU(Godson-D)
›
Godson-D Features for High Throughput Computing
19
CO-DESIGN 2015
HPC versus HTC
Conventional
Workload
FLOPS
Characteristics
Parallelism
Now
Realtime
Reliability
Throughput
Requirement
Performance
Requirement
Conventionally,
Internet Service HTC systems were implemented
High
NextHigh-Volume
Abundant
Throughput
by HPC
As requirements
of for
Data infrastructures.
Thread-level
Yes
Diversity
generation
multiple
Parallelism
Diversity
Tasks
tasks
data-center
throughput,
QoS,
scalability
and
reliability
are
Poor Locality
increasing in emerging HTC systems,
High- conventional wisdom is no longer suitable for
Single-node
Highperformance Scientific Service Polymorphic
fault causes Performance
next-generation
data-center
computing.
Parallelism
No
computing
Mono-Task
whole system for single
system
Good Locality
Exploitation
crash down
task
A Case Study



Story:
A system with hundreds of general purpose high performance
processors
Serves about 1,000,000 users concurrently
Zoom in:
-- one state-of-art processor can process about 7000 users’
requests in specific period
-- each user’s data size is about 10KB
What will happen?
21
CO-DESIGN 2015
……
L2 Cache
Miss Rate: 94.86%
L3 Cache
Miss Rate: 72.37%
Memory
Dynamic user data
Static user data
…
7000 users’ data compete for memory space
Phenomenon
L2 and L3 caches contribute little because of high miss rate
Float point Unit is useless for this mobile internet scenario
Main Cause
 Massive independent user request compete for limited on-chip memory, and
memory access latency dominates the throughput
22
CO-DESIGN 2015
Other HTC Benchmarks

Cache Miss Rate on Commercial Processor
23
CO-DESIGN 2015
Problems in Commercial Processors
We can easily find the following problems in current data center:

Problem 1: Mismatch between Data Format and Data Structure in CPU


Problem 2: Mismatch between Data Flow and Data Path in CPU


-- On-chip Memory
-- Data Path
Problem 3: Mismatch between Special Data Processing requirement and the
General Architecture of CPU


-- Computing
Problem 4: Mismatch between Real Time Data Processing Requirement and
the Unpredictable Architecture of CPU

-- Comprehensive
24
CO-DESIGN 2015
Godson-D Features for HTC
Thousands of threads running
concurrently to hide latency
High throughput data access path
Data Processor Unit
Godson-D
Realtime control technologies
25
CO-DESIGN 2015
Godson-D Architecture
Hierarchical Ring + Cone

+
Thread
Dispatcher
ALU
ALU
ALU
ALU
Shared
SPM
D$
D$
D$
D$
D$
D$
D$
D$
D$
D$
D$
D$
D$
D$
D$
D$
Shared
D$
Dispatcher
Dispatcher
Dispatcher
AGU
AGU
AGU
AGU
scheduler
Dispatcher
Dispatcher
Dispatcher
Dispatcher
Dispatcher
Dispatcher
Dispatcher
Dispatcher
Dispatcher
Dispatcher
Dispatcher
Dispatcher
CO-DESIGN 2015
26
Thread Core Group
Shared I$
Shared FU
Memory Access Collect Table(M-Table)
1. M-Table can collect the discrete memory access
requirements together and process them in batch, thus
improve the memory efficiency.
RAM
N cycle
Oldest
N cycle
N cycle
N cycle
N cycle
Performance Speedup: 2.82X
Memory Access Latency: 1.58X
Number of memory requirements: 43%
27
CO-DESIGN 2015
High density NoC(HD-NoC)
• There is a large proportion of small message packets in HTC
applications
• Conventional large links are split into small channels (such as 8 or
16 bits) so that flits in the same direction can be transferred64bit
simultaneously
Typical
Router
Granularity Distribution
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
64bit
64bit
R
8
64bit
4
2
64bit
1
64bit
rd
wt
BFS
rd
wt
GUPS
rd
wt
rd
wt
rd
wt
rd
64bit
R
HD-Router
wt
SSCA2 canneal listrank pagerank
64bit
The effective bandwidth utilization can be improved by 4.5X
28
CO-DESIGN 2015
Hardware Supported Realtime
Scheduler
Fast Pass
Two Jump to reach MCU for
urgent thread
1
2
Hardware realtime scheduler
Normal Thread
Pointer
Null Thread
Pointer
High Priority
Thread Pointer
PID
PID
PID
PID
PID
PID
PID
PID
PID
PID
PID
PID
TID
TID
TID
TID
TID
TID
TID
TID
TID
TID
TID
TID
Occupied
Occupied
Occupied
Occupied
Occupied
Occupied
Occupied
Occupied
Occupied
Occupied
Occupied
Occupied
Core_Num
Core_Num
Core_Num
Core_Num
Core_Num
Core_Num
Core_Num
Core_Num
Core_Num
Core_Num
Core_Num
Core_Num
29
Threshold_Time
Threshold_Time
Threshold_Time
Threshold_Time
Threshold_Time
Threshold_Time
Threshold_Time
Threshold_Time
Threshold_Time
Threshold_Time
Threshold_Time
Threshold_Time
Context_Pointer
Context_Pointer
Context_Pointer
Context_Pointer
Context_Pointer
Context_Pointer
Context_Pointer
Context_Pointer
Context_Pointer
Context_Pointer
Context_Pointer
Context_Pointer
Next_Thread
Next_Thread
Next_Thread
Next_Thread
Next_Thread
Next_Thread
Next_Thread
Next_Thread
Next_Thread
Next_Thread
Next_Thread
Next_Thread
CO-DESIGN 2015
Effect of global realtime scheduling
Search
235
209
183
157
131
105
79
53
27
1
thread_id
thread_id
Search
0
2 000 000 4 000 000 6 000 000
finished_cycles
8 000 000
241
217
193
169
145
121
97
73
49
25
1
0
241
217
193
169
145
121
97
73
49
25
1
0
20000000 40000000 60000000
finished_cycles
8000000
Terasort
Thread_id
thread_id
TeraSort
2000000 4000000 6000000
finished_cycles
80000000
235
209
183
157
131
105
79
53
27
1
0
30
20000000 40000000 60000000 80000000
Finished_cycles
CO-DESIGN 2015
HTC System vs. HPC System
HTC
HPC
Memory queue
Processing
Unit
Task queue
Data path
Performanceoriented scheduling
Single direction
M-Table
T-Table
Hardware support
QoS scheduling
Bidirection dynamic
self-adaption
Low width/
High density
High width /
low density
Message-based
memory
Imperative memory
access
Memory
Perf-oriented Cache
replacement
31
QoS oriented Cache
replacement
CO-DESIGN 2015
Thank you!
Download