저전력 MP-SoC-4 - VADA

advertisement
SCHEDULING AND TIMING ANALYSIS OF
HW/SW ON-CHIP COMMUNICATION IN
MP SOC DESIGN
성균관대학교 정보통신공학부
© 조준동
2006년 가을
1
Contents
– Introduction
– Motivation
– Communication Scheduling and HW/SW
Timing Analysis
– Experimental Results
– Conclusion
© 조준동, 2006년 여름
2
Introduction
• On-chip Communication Architecture in MP SoC
design
• On-chip Communication Design
– Design of HW/SW Communication Architecture
– Mapping and scheduling of on-chip communication
• Contribution of this work is the consideration of
both of
– Dynamic behavior of SW communication architecture
– Physical communication buffer sharing
© 조준동, 2006년 여름
3
Preliminaries
• Mapping/Allocation
• Communication Scheduling
© 조준동, 2006년 여름
4
Motivation
© 조준동, 2006년 여름
5
Extended Task Graph
© 조준동, 2006년 여름
6
Communication Delay Model
Communication Delay
Ci(n)  C iSW (n)  C iHW (n)  C iOCN (n)
– Communication Delay of SW Communication
Architecture
C iSW (n)  R(vi )  (C ISR(n)  CCS (n))  CDD( n)
– Communication Delay of HW Communication
Interface
C (n)  n  (delay of HW Communicat ion Interface)
HW
i
– Communication Delay of On-chip
Communication Network
C iOCN (n)  Dtrans  n  DBLOCKED
© 조준동, 2006년 여름
7
Communication Scheduling and Timing
Analysis
• ILP for Scheduling Communication Nodes and
Tasks of ETG
– Data dependency constraints
t1end  t 2start and t1end  t 3start
– Resource contention constraints
– Processor and on-chip communication network
© 조준동, 2006년 여름
8
ILP for Binary variable
– Physical communication buffer contention constraints
© 조준동, 2006년 여름
9
Heuristic Algorithm
• List scheduling
LIST (G(V,E),a) {
I=1;
Repeat {
For each resource type k=1,2,…., nres {
Determine candidate tasks UI , k;
Determine unfinished tasks TI , K ;
Select Sk  UI , K nodes, such that Sk  T 1, k  ak;
Schedule the Sk tasks at time I
by setting
ti  I , (i : vi  Sk )
}
} until ( vn is scheduled);
Return (t);
}
© 조준동, 2006년 여름
10
Experiments
© 조준동, 2006년 여름
11
Experimental Results
• Execution delay of
tasks in the H.263
system
• Delay of software
communication architecture
– Software
communication
architecture delay
measured by ISS.
Functional
Block
Execution
Cycle
Source
1386
Motion
predictor
1123650
Macro block
875358
Encoder
875358
VLC
2103156
Services
Execution
Cycle
ISR
857
CS
1060
DD(read)
5376
DD(write)
6528
© 조준동, 2006년 여름
12
Execution Time of H.263 encoder
© 조준동, 2006년 여름
13
Execution Time of JPEG and IS-95 example
© 조준동, 2006년 여름
14
Conclusion
• On-chip Communication Design
– Design of HW/SW Communication Architecture
– Mapping and scheduling of on-chip communication
• Communication Scheduling and Timing Analysis
– ILP Formulation
– Heuristic
• Consideration of
– Dynamic behavior of SW communication architecture
– Physical communication buffer sharing
• Future Work
– To extend the approach to the complicated On-Chip
Network
– To design On-Chip Communication Scheduler
© 조준동, 2006년 여름
15
Deep-submicron 저전력 설계
성균관대학교 정보통신공학부
© 조준동
2006년 가을
16
Communication
Architecture of MP-SoC Pl
atform
성균관대
조준동
성균관대학교 정보통신공학부
© 조준동
2006년 가을
17
Talk Outline
• MP-SoC Architectures:
– Homogeneous Architecture
– Heterogeneous Architecture
• Crossbar Interconnection Models
• MP-SoC Evaluation Methodology
© 조준동, 2006년 여름
18
Networks-on-Silicon, Phillips
Albert van der Werf,
Philips Research
© 조준동, 2006년 여름
19
DSP based commercial SoC
© 조준동, 2006년 여름
20
MP-SOC Platforms?
• Platforms : An architecture that is
designed for an application domain
(소비자 수요에 대처, 앞으로의 변화 예측)
• Multiprocessor systems-on-chips:
(Usually heterogeneous multiprocessor)
– CPUs, DSPs, etc.
– Hardwired accelerators.
– Mixed-signal front end.
© 조준동, 2006년 여름
21
SoC topics …
• Task level analysis and optimizations
– Scheduling and Resource sharing
– Mapping data flow to the target HW
– Task distributed splitting and merging
• Efficient communication Optimization
– BW and Memory allocation
– Synchronizations
– Buffers
– Routing: Shortest path routing w/ minimal router
logics
– Irregular meshes
– Mapping to topology: What topology will suit a partic
ular application?
© 조준동, 2006년 여름
22
From computation centric to
communication
centric architectures
– System optimization:latency, throughput, BW.
– Interconnect consumes up to 50% of energy
and growing…
– Scalable
– Should also be optimized
• Topology
• Links BW
• Place & Route
• Routing protocol
© 조준동, 2006년 여름
23
4G: Multiple standards
Software Defined Radio & Multimedia
• A number of components might be pleased on a single die in
order to decrease the production cost. Ex) h.264+ MPEG4
• Multi-DSP : Wibro, MoIP
• Maximum parallelism !
– Parallel data transfer among the components.
– Read, write and calculate processes should be decoupled.
DSP1
RAM
DSP2
TX
network
DSPn
© 조준동, 2006년 여름
24
Why Multi-Threaded Cores?
STMicroelctronics MultiFlex MP-SoC
Increasing
gap: memory
& processor
speeds
(2x / 2 years)
More parallel
processing
(lower-power,
higher-perf./mm2)
RISC
D$
I$
$
H/W O/S
Schedulers
DSP
DSP
DSP
DSP
I$
NoC
In
Out
SRAM
Increasing gap:
interconnect &
gate delays
(multi-clock
intra-chip delay)
© 조준동, 2006년 여름
25
Exploitable Parallelism
Min parallel
grain size
(instrns.)
MultiFlex ThreadLevel
GP O/S
Parallelism
Thread-Level
Parallelism
Exploitable
task
parallelism
InstructionLevel
Parallelism
10 000’s
Instructions
1~100
1~8
100’s
2~6
© 조준동, 2006년 여름
1
26
Execution
Codec Results
VGA Video up
MPEG4speed
Theoretical
upper bound
35
MultiFlex
(0 latency bus)
Frame/Sec
30
25
MultiFlex result
(STBus)
20
15
10
5
0
3
2
8 threads
4 threads
4
Number of ARMs
Theoretical
Latency = 0
5
6
2 threads
© 조준동, 2006년 여름
27
Intel IXP1200 Network
Processor
© 조준동, 2006년 여름
28
IXP1200 MicroEngine
© 조준동, 2006년 여름
29
Holistic design of multi-core architectures
• Naïve Methodology is inefficient
• Demonstrated inefficiency for cores and proposed alternatives
– Single-ISA Heterogeneous Multi-core Architectures for
Power[MICRO03]
– Single-ISA Heterogeneous Multi-core Architectures for
Performance[ISCA04]
– Conjoined-core Chip Multiprocessing [MICRO04]
• What about interconnects?
– How much can interconnects impact processor architecture?
– Need to be co-designed with caches and cores?
© 조준동, 2006년 여름
30
– Delivering clocks is problematic
– Wire delay is dominant
– Too many constraints, from too
many blocks
• Drew Wingard, CTO, Sonics, Inc.
• wingard@sonicsinc.com
© 조준동, 2006년 여름
31
© 조준동, 2006년 여름
32
Various Interconnect Topologies
Linear (Pipeline)
Star (Switch)
Mesh (NoC)
Tree(PCI-Express) Fully connected
(When do we need it?)
Ring (Chip&Slow)
Bus (Old stile)
Ring w/ bypass Hybrid (Dedicated)
(Async)
© 조준동, 2006년 여름
33
State of the Art: Network on Chip
Networks are preferred over buses:
•
•
•
•
•
Higher bandwidth
Concurrency, effective spatial reuse of resources
Higher levels of abstraction
Modularity - Design Productivity Improvement
Scalability
© 조준동, 2006년 여름
34
Network on Chip
Module
Module
Module
Module
Module
Module
Module
Module
Module
Module
Module
Module
The idea:
• Decouple Communication and Computation
• Simple routers instead of repeaters
• Route packets instead of wires
• Provide diff. Services on the same infrastructure
© 조준동, 2006년 여름
35
Communication Centric Design Flow,
Jeremy
Application
Architecture Library
Architecture / Application Model
NoC Optimisation
Configure
Refine
Evaluate
Analyse / Profile
Good?
No
Synthesis
Optimized
NoC
© 조준동, 2006년 여름
36
System Development Flow of
QNoC
Connect modules with an ideal network
And measure traffic
NoC Optimisation
Place Modules
Map traffic to grid using given QNoC architcture
Balance utilization, minimize cost, and meet QoS
Refine QnoC
Analyse / Profile
Good?
Synthesis
No
Optimized
NoC
© 조준동, 2006년 여름
37
MPSoC design time application
optimization and exploration
응용 분야 스펙
응용 분석, 분할, 변환 및 탐색
개선된 응용 분야 스펙 + 파레토 커브 발생기
개선
플랫폼 관리자
QoS 만족?
리소스 할당
No
플랫폼 검증
© 조준동, 2006년 여름
38
© 조준동, 2006년 여름
39
QNoC (Technion) vs. Alternative
Solutions
Mesh (4x4): Uniform scenario (Same QoS):
Arch.
Freq
uenc
y
Utilizatio
n
Av.
Link
Width
Wire-Length(Area) and Power
100.0
Wire Length
45.0
QNoC
1GH
z
30%
Power
28
Bus
50
MHz
PTP
100
MHz
50%
80%
3 700
6
Cost
10.0
3.8
2.9
1.0
1.0
0.8
1.0
0.1
BUS
NoC
BUS QNoC
PTP
PTP
© 조준동, 2006년 여름
40
NoC Design solution
• explore various topology options to determine those that best meet the
system objectives.
• tune the arbitration scheme and network links (for example, burst lengths,
communication FIFO sizes, etc.)
•NoCcompiler is used to create the NoC instance ( packet format options,
burst types, special transactions, maximum packet payload length, network
port configurations, etc)
•A cycle-accurate simulation model can then be generated to verify with
SystemC or RTL simulation.
© 조준동, 2006년 여름
41
Danube NoC IP
• • Support for OCP 2.0, AMBA AHB and AMBA AXI socket
interfaces
• • Clock Frequency up to 750 MHz. in 90 nm process
• • GALS links for spanning distance and crossing clock
boundaries
• • Unlimited user defined topology through Arteris NoCexplorer
• and NoCcompiler
• • Flexible pipelining and FIFO management
• • Customizable NoC Transaction and Transport Protocol (NTTP)
packet format
• • On chip protocol for runtime SoC application debug
• • Lightweight Service bus for runtime debug, error
management,
• register configuration
© 조준동, 2006년 여름
42
SoftStream HERA 3000(mediaexcel)(2)
 PCI board ( can see 6 DSPs)
© 조준동, 2006년 여름
43
KOMPROCESSOR AVC(Ateme)

The Kompressor Board is a multiencoding short PCI board

MPEG-4 AVC H.264 Live encode, IP
Streaming and simultaneous file
recording

TMS320C64x DSPs from Texas Instruments
used

FPGA – Cyclone from Altera used

Input – NTSC or PAL

VIDEO Compression

MPEG-4 AVC / H.264(ISO/IEC 1449610) Baseline, Main, High Profiles
© 조준동, 2006년 여름
44
Application과 전망
• IP 카메라와 스트리머, 네트워크 비디오 레코더
(NVR)등 실시간 비디오 감시
• 다기능 프린터와 스캐너, 필름 처리 장비, 화상 인
식 시스템 등 고성능 이미지 처리 시장의 성능과 통
합 요건을 만족하고 또한 인코딩, 스트리밍 및 트랜
스코딩 장비, 화상회의 시스템, 비디오 폰 등 방송
및 IP TV
© 조준동, 2006년 여름
45
HiBRID-SoC Architecture

HIBRID-SoC multi-core system-on-chip Architecture

Integrate a powerful on-chip communication structure

A well-balanced memory system to account for the growing amount of data memory
system (e.g., in the area of video, Mpeg-4 part 10 or Advanced Video Coding (AVC))

Dedicated chips for the Mpeg-4 Simple Profile, consists of a very general processing
demend


Three programmable cores

Each adapted towards a specific class of algorithms

Combination of the cores and their software development environment

An extention of a programmable core with dedicated modules (e.g.,Trimedia)
HIBRID-SoC multi core

Developed at the University of Hannover
© 조준동, 2006년 여름
46
Morpho(MS1-16 v002)
• -16 16x16-bit MAC operations/cycle
http://www.morphotech.com/
@500MHz
• -4.8 GMAC/s @500MHz
• -420 and 480 TBGA packaging (0.18u/0.13u)
• -core voltages at 1.8V/1.2V and 3.3V
© 조준동, 2006년 여름
47
SandBridge(SB3000)
• -응용 분야
• MPEG capture and playback for video or
videoconferencing
• JPEG capture and playback for camera and
display functions
• MP3 capture and playback for music or
ringtone functions
• Other computation-intensive multimedia
functions (e.g. speech control)
© 조준동, 2006년 여름
48
NEC와 ARM
• 전화 및 가전제품용으로 2개 이상의 프로세싱 코어와 연산 유닛을 마이크로프
로세서 안에 장착한 새로운 칩을 개발중이다.
• 휴대폰 제조업체들에게도 다양한 활용방안이 있다. 예를 들면 코어중 하나는 통
화기능에, 다른 하나는 인터넷 트래픽 관리에 사용할 수 있다.
• 1.2GHz의 ARM11 프로세서 제품군과 동급의 성능을 기록했으며 600 mW가
량의 전력소비량으로 최고 1440 DMIPS의 성능, 130nm 프로세스
• 리눅스 SMP OS 포트 :전력 소비 절감과 프로세서간 애플리케이션 로드 자동
밸런싱 기능
© 조준동, 2006년 여름
49
Efficient Shared DRAM
Subsystems for SOCs, Sonics Inc.
• Increases SOC performance
• Improves efficiency of off-chip DRAM by up to 40%
• Guarantees Quality of Service for on-chip cores
• Lowers SOC costs
• Consolidates and reduces multiple distributed buffers
• Single Smart Interconnet replaces multiple layered busses
• Shortens time to market
• Smart Interconnect removes wire routing problem of classical
architectures
• Accurate architectural exploration ensures functionality in first
days of development
• Increased market penetration
• DRAM technology selection decoupled from the rest of the SOC
• Threaded architecture enables easy scalability without re-design
of memory subsystem
© 조준동, 2006년 여름
50
Shared Bus and DRAM
Subsystem
The traditional computer bus organization suffers from low
DRAM efficiency, a lack of quality-of-service.
© 조준동, 2006년 여름
51
Star Topology Access to a Shared
DRAM Subsystem
•
1)
2)
3)
4)
DRAM controller
sees all initiator requests at the same time and can select the order of
servicing Them
optimize the performance of the DRAM subsystem by reordering requests
providing flexible quality-of-service to each of the initiators.
causes a large number of wires to converge on the DRAM controller,
producing physical problems for the design.
© 조준동, 2006년 여름
52
SiliconBackplane and DRAM
Scheduler
The shared μnetwork remedies the wire congestion problem.
The DRAM scheduler addresses both the DRAM performance
issues.
Quality-of-service guarantees by selectively scheduling the
DRAM accesses.
© 조준동, 2006년 여름
53
Unified DRAM and QoS
Scheduling
© 조준동, 2006년 여름
54
Set-top-box System
© 조준동, 2006년 여름
55
Bandwidth Profile of Bus with
Round-Robin Arbitration
Many of the initiators receive less than their
required bandwidth so overall application
requirements are unsatisfied.
© 조준동, 2006년 여름
56
Bandwidth Profile of Bus with
Priority Arbitration
From 5000 to 9000 cycles all but two initiators (CPU
and DSP) receive no service at all. Clearly, this is
unacceptable for the set-top-box application.
© 조준동, 2006년 여름
57
Bandwidth Profile of Sonics
Solution
Each of the initiators are connected to the DRAM using a
Silicon Backplane μNetwork, a DRAM scheduler, DRAM controller.
The Silicon Backplane and DRAM bandwidth have been allocated to the
different initiators according to their needs.
All application requirements are met and overall DRAM utilization is pretty
steady at around 70%.
© 조준동, 2006년 여름
58
References
• Terry Tao Ye, On-Chip Multiprocessor
Communication Network Design and Analysis,
Ph.D. Dissertation, Stanford Univ.
• E. Bolotin, et al., Automatic hardware-Efficient
SoC Integration by QoS network on Chip, Israel
Institute of Tech, Haifa, Israel.
• E. Bolotin, et al., Efficient Routing in Irregular
Topology NoCs,
Technion- Israel Institute of Tech
© 조준동, 2006년 여름
59
The Standford Hydra CMP
•
•
•
•
•
•
Lance Hammond
Benedict A. Hubbert
Michael Siu
Manohar K. Prabhu
Michael Chen
Kunle Olukotun
성균관대학교 정보통신공학부
Presented
by Jason Davis
60
© 조준동
2006년 가을
Introduction
•
•
•
Hydra CMP with 4 MIPS Processors
L1 cache for each CPU and L2 cache
that holds the permanent states
Why?
– Moore’s law is reaching its end
– Finite amount of ILP
– TLP (Thread Level Parallelism) vs ILP
in pipelined architecture
– CMP can use ILP as well (TLP and ILP
are orthogonal)
– Wire Delay
– Design Time (CPU core doesn’t need
to be redesigned) just increase the
number
•
Problems
– Integration densities just now giving
reasons to consider new models
– Difficult to convert uniprocessor code
– Multiprogramming is hard
© 조준동, 2006년 여름
61
Base Design
• 4 MIPS Cores (250 MHz)
– Each core:
•
• L1 Data Cache
• L1 Primary Instruction
Cache
– Share a single L2 Cache
•
– Virtual Buses (pipelined with
repeaters)
•
Read bus (256 bits)
– Acts as general purpose system bus for
moving data between CPUs, L2, and
external memory
– Wide enough to handle entire cache line
(CMP explicit gain, multiprocessor systems
would require too many pins
Write bus (64 bits)
– Writes directly from 4 CPUs to L2
– Pipelined to allow for single-cycle occupancy
(not a bottleneck)
– Uses simple invalidation for caches
(broadcast invalidates all other L1s)
L2 Cache
–
•
Point of communication (10-20 cycles)
Bus Sufficient for 4-8 MIPS cores, more
need larger system buses
© 조준동, 2006년 여름
62
Base Design
© 조준동, 2006년 여름
63
Parallel Software Performance
© 조준동, 2006년 여름
64
Thread Speculation
• Takes sequence of instructions on normal program and arbitrarily
breaks it into a sequenced group of threads
– Hardware must track all interthread dependencies to insure program acts
the same way
– Must re-execute code that follows a data violation based upon a true
dependency
• Advantages:
– Does not require synchronization (different than enforcing dependencies on
multiprocessor systems)
– Dynamic (done at runtime) so programmer only needs to consider for
maximum performance
– Conventional Parallelizing compilers miss a lot of TLP because
synchronization points must be inserted where dependencies can happen
and not just where they do happen
• 5 Issues to address:
© 조준동, 2006년 여름
65
Thread Speculation
1. Forward data between
parallel threads
2. Detect when reads occur
to early (RAW)
3. Safely Discard
speculative state after
violations
© 조준동, 2006년 여름
66
Thread Speculation
4. Retire speculative writes
in correct order (WAW
hazard)
5. Provide Memory
renaming (WAR
hazards)
© 조준동, 2006년 여름
67
Hydra Speculation Implementation
•
Takes care of the 5 issues:
– Forward data between parallel threads:
• When thread writes to bus, newer threads that need the data have
their current cache lines for that data invalidated
• On miss in L1, access L2, write buffers of current or older thread
replaces data returned from L2 byte-byte
– Detect when read occurs too early:
• Primary cache bits are set to mark possible violations, if write to that
address of an earlier thread invalidates – Violation detected and thread
is restarted.
– Safely discard speculative states after violation:
• Permanent state kept in L2, any L1 lines that are speculative data are
invalidated, L2 buffer for thread is discarded (permanent state not
effected)
© 조준동, 2006년 여름
68
Hydra Speculation Implementation
– Place speculative writes in memory in correct order:
• Separate speculative data L2 buffers kept for each thread
• Must be drained into L2 in original sequence
• Thread sequencing system also sequences the buffer draining
– Memory Renaming:
• Each CPU can only read data written by itself or earlier threads
• Writes from later threads don’t cause immediate invalidations (since
writes from these threads should not be visible yet)
• Ignored invalidations are recorded with pre-invalidate bit
• If thread accesses L2 it must only access data it should be able to see
from itself or earlier L2 buffers
• When current thread completes all currently pre-invalidated lines are
check against future threads for violations
© 조준동, 2006년 여름
69
Hydra Speculation Implementation
© 조준동, 2006년 여름
70
Hydra Speculation Implementation
© 조준동, 2006년 여름
71
Speculation Performance
© 조준동, 2006년 여름
72
Prototype
•
•
•
•
•
•
MIPS-based RC32364
SRAM macro cells
8-Kbyte L1 data and instruction caches
128 Kbytes L2
Die is 90 mm^2, .25-micron process
Have a verilog model, moving to physical design
using synthesis
• Central Arbritration for Buses will be the most
difficult part, hard to pipeline, must accept many
requests, and must reply with grant signals
© 조준동, 2006년 여름
73
Prototype
© 조준동, 2006년 여름
74
Prototype
© 조준동, 2006년 여름
75
Conclusion
• Hydra CMP
– High performance
- Cost effective alternative to large chip single
processors
- Similar die area can achieve similar to uniprocessor
performance on integer programs using thread
speculation
- Multiprogrammed or High Parallelism can do better
then single processor
- Hardware Thread-Speculation is not cost intensive, and
can give great gains to performance
© 조준동, 2006년 여름
76
Download