SCHEDULING AND TIMING ANALYSIS OF HW/SW ON-CHIP COMMUNICATION IN MP SOC DESIGN 성균관대학교 정보통신공학부 © 조준동 2006년 가을 1 Contents – Introduction – Motivation – Communication Scheduling and HW/SW Timing Analysis – Experimental Results – Conclusion © 조준동, 2006년 여름 2 Introduction • On-chip Communication Architecture in MP SoC design • On-chip Communication Design – Design of HW/SW Communication Architecture – Mapping and scheduling of on-chip communication • Contribution of this work is the consideration of both of – Dynamic behavior of SW communication architecture – Physical communication buffer sharing © 조준동, 2006년 여름 3 Preliminaries • Mapping/Allocation • Communication Scheduling © 조준동, 2006년 여름 4 Motivation © 조준동, 2006년 여름 5 Extended Task Graph © 조준동, 2006년 여름 6 Communication Delay Model Communication Delay Ci(n) C iSW (n) C iHW (n) C iOCN (n) – Communication Delay of SW Communication Architecture C iSW (n) R(vi ) (C ISR(n) CCS (n)) CDD( n) – Communication Delay of HW Communication Interface C (n) n (delay of HW Communicat ion Interface) HW i – Communication Delay of On-chip Communication Network C iOCN (n) Dtrans n DBLOCKED © 조준동, 2006년 여름 7 Communication Scheduling and Timing Analysis • ILP for Scheduling Communication Nodes and Tasks of ETG – Data dependency constraints t1end t 2start and t1end t 3start – Resource contention constraints – Processor and on-chip communication network © 조준동, 2006년 여름 8 ILP for Binary variable – Physical communication buffer contention constraints © 조준동, 2006년 여름 9 Heuristic Algorithm • List scheduling LIST (G(V,E),a) { I=1; Repeat { For each resource type k=1,2,…., nres { Determine candidate tasks UI , k; Determine unfinished tasks TI , K ; Select Sk UI , K nodes, such that Sk T 1, k ak; Schedule the Sk tasks at time I by setting ti I , (i : vi Sk ) } } until ( vn is scheduled); Return (t); } © 조준동, 2006년 여름 10 Experiments © 조준동, 2006년 여름 11 Experimental Results • Execution delay of tasks in the H.263 system • Delay of software communication architecture – Software communication architecture delay measured by ISS. Functional Block Execution Cycle Source 1386 Motion predictor 1123650 Macro block 875358 Encoder 875358 VLC 2103156 Services Execution Cycle ISR 857 CS 1060 DD(read) 5376 DD(write) 6528 © 조준동, 2006년 여름 12 Execution Time of H.263 encoder © 조준동, 2006년 여름 13 Execution Time of JPEG and IS-95 example © 조준동, 2006년 여름 14 Conclusion • On-chip Communication Design – Design of HW/SW Communication Architecture – Mapping and scheduling of on-chip communication • Communication Scheduling and Timing Analysis – ILP Formulation – Heuristic • Consideration of – Dynamic behavior of SW communication architecture – Physical communication buffer sharing • Future Work – To extend the approach to the complicated On-Chip Network – To design On-Chip Communication Scheduler © 조준동, 2006년 여름 15 Deep-submicron 저전력 설계 성균관대학교 정보통신공학부 © 조준동 2006년 가을 16 Communication Architecture of MP-SoC Pl atform 성균관대 조준동 성균관대학교 정보통신공학부 © 조준동 2006년 가을 17 Talk Outline • MP-SoC Architectures: – Homogeneous Architecture – Heterogeneous Architecture • Crossbar Interconnection Models • MP-SoC Evaluation Methodology © 조준동, 2006년 여름 18 Networks-on-Silicon, Phillips Albert van der Werf, Philips Research © 조준동, 2006년 여름 19 DSP based commercial SoC © 조준동, 2006년 여름 20 MP-SOC Platforms? • Platforms : An architecture that is designed for an application domain (소비자 수요에 대처, 앞으로의 변화 예측) • Multiprocessor systems-on-chips: (Usually heterogeneous multiprocessor) – CPUs, DSPs, etc. – Hardwired accelerators. – Mixed-signal front end. © 조준동, 2006년 여름 21 SoC topics … • Task level analysis and optimizations – Scheduling and Resource sharing – Mapping data flow to the target HW – Task distributed splitting and merging • Efficient communication Optimization – BW and Memory allocation – Synchronizations – Buffers – Routing: Shortest path routing w/ minimal router logics – Irregular meshes – Mapping to topology: What topology will suit a partic ular application? © 조준동, 2006년 여름 22 From computation centric to communication centric architectures – System optimization:latency, throughput, BW. – Interconnect consumes up to 50% of energy and growing… – Scalable – Should also be optimized • Topology • Links BW • Place & Route • Routing protocol © 조준동, 2006년 여름 23 4G: Multiple standards Software Defined Radio & Multimedia • A number of components might be pleased on a single die in order to decrease the production cost. Ex) h.264+ MPEG4 • Multi-DSP : Wibro, MoIP • Maximum parallelism ! – Parallel data transfer among the components. – Read, write and calculate processes should be decoupled. DSP1 RAM DSP2 TX network DSPn © 조준동, 2006년 여름 24 Why Multi-Threaded Cores? STMicroelctronics MultiFlex MP-SoC Increasing gap: memory & processor speeds (2x / 2 years) More parallel processing (lower-power, higher-perf./mm2) RISC D$ I$ $ H/W O/S Schedulers DSP DSP DSP DSP I$ NoC In Out SRAM Increasing gap: interconnect & gate delays (multi-clock intra-chip delay) © 조준동, 2006년 여름 25 Exploitable Parallelism Min parallel grain size (instrns.) MultiFlex ThreadLevel GP O/S Parallelism Thread-Level Parallelism Exploitable task parallelism InstructionLevel Parallelism 10 000’s Instructions 1~100 1~8 100’s 2~6 © 조준동, 2006년 여름 1 26 Execution Codec Results VGA Video up MPEG4speed Theoretical upper bound 35 MultiFlex (0 latency bus) Frame/Sec 30 25 MultiFlex result (STBus) 20 15 10 5 0 3 2 8 threads 4 threads 4 Number of ARMs Theoretical Latency = 0 5 6 2 threads © 조준동, 2006년 여름 27 Intel IXP1200 Network Processor © 조준동, 2006년 여름 28 IXP1200 MicroEngine © 조준동, 2006년 여름 29 Holistic design of multi-core architectures • Naïve Methodology is inefficient • Demonstrated inefficiency for cores and proposed alternatives – Single-ISA Heterogeneous Multi-core Architectures for Power[MICRO03] – Single-ISA Heterogeneous Multi-core Architectures for Performance[ISCA04] – Conjoined-core Chip Multiprocessing [MICRO04] • What about interconnects? – How much can interconnects impact processor architecture? – Need to be co-designed with caches and cores? © 조준동, 2006년 여름 30 – Delivering clocks is problematic – Wire delay is dominant – Too many constraints, from too many blocks • Drew Wingard, CTO, Sonics, Inc. • wingard@sonicsinc.com © 조준동, 2006년 여름 31 © 조준동, 2006년 여름 32 Various Interconnect Topologies Linear (Pipeline) Star (Switch) Mesh (NoC) Tree(PCI-Express) Fully connected (When do we need it?) Ring (Chip&Slow) Bus (Old stile) Ring w/ bypass Hybrid (Dedicated) (Async) © 조준동, 2006년 여름 33 State of the Art: Network on Chip Networks are preferred over buses: • • • • • Higher bandwidth Concurrency, effective spatial reuse of resources Higher levels of abstraction Modularity - Design Productivity Improvement Scalability © 조준동, 2006년 여름 34 Network on Chip Module Module Module Module Module Module Module Module Module Module Module Module The idea: • Decouple Communication and Computation • Simple routers instead of repeaters • Route packets instead of wires • Provide diff. Services on the same infrastructure © 조준동, 2006년 여름 35 Communication Centric Design Flow, Jeremy Application Architecture Library Architecture / Application Model NoC Optimisation Configure Refine Evaluate Analyse / Profile Good? No Synthesis Optimized NoC © 조준동, 2006년 여름 36 System Development Flow of QNoC Connect modules with an ideal network And measure traffic NoC Optimisation Place Modules Map traffic to grid using given QNoC architcture Balance utilization, minimize cost, and meet QoS Refine QnoC Analyse / Profile Good? Synthesis No Optimized NoC © 조준동, 2006년 여름 37 MPSoC design time application optimization and exploration 응용 분야 스펙 응용 분석, 분할, 변환 및 탐색 개선된 응용 분야 스펙 + 파레토 커브 발생기 개선 플랫폼 관리자 QoS 만족? 리소스 할당 No 플랫폼 검증 © 조준동, 2006년 여름 38 © 조준동, 2006년 여름 39 QNoC (Technion) vs. Alternative Solutions Mesh (4x4): Uniform scenario (Same QoS): Arch. Freq uenc y Utilizatio n Av. Link Width Wire-Length(Area) and Power 100.0 Wire Length 45.0 QNoC 1GH z 30% Power 28 Bus 50 MHz PTP 100 MHz 50% 80% 3 700 6 Cost 10.0 3.8 2.9 1.0 1.0 0.8 1.0 0.1 BUS NoC BUS QNoC PTP PTP © 조준동, 2006년 여름 40 NoC Design solution • explore various topology options to determine those that best meet the system objectives. • tune the arbitration scheme and network links (for example, burst lengths, communication FIFO sizes, etc.) •NoCcompiler is used to create the NoC instance ( packet format options, burst types, special transactions, maximum packet payload length, network port configurations, etc) •A cycle-accurate simulation model can then be generated to verify with SystemC or RTL simulation. © 조준동, 2006년 여름 41 Danube NoC IP • • Support for OCP 2.0, AMBA AHB and AMBA AXI socket interfaces • • Clock Frequency up to 750 MHz. in 90 nm process • • GALS links for spanning distance and crossing clock boundaries • • Unlimited user defined topology through Arteris NoCexplorer • and NoCcompiler • • Flexible pipelining and FIFO management • • Customizable NoC Transaction and Transport Protocol (NTTP) packet format • • On chip protocol for runtime SoC application debug • • Lightweight Service bus for runtime debug, error management, • register configuration © 조준동, 2006년 여름 42 SoftStream HERA 3000(mediaexcel)(2) PCI board ( can see 6 DSPs) © 조준동, 2006년 여름 43 KOMPROCESSOR AVC(Ateme) The Kompressor Board is a multiencoding short PCI board MPEG-4 AVC H.264 Live encode, IP Streaming and simultaneous file recording TMS320C64x DSPs from Texas Instruments used FPGA – Cyclone from Altera used Input – NTSC or PAL VIDEO Compression MPEG-4 AVC / H.264(ISO/IEC 1449610) Baseline, Main, High Profiles © 조준동, 2006년 여름 44 Application과 전망 • IP 카메라와 스트리머, 네트워크 비디오 레코더 (NVR)등 실시간 비디오 감시 • 다기능 프린터와 스캐너, 필름 처리 장비, 화상 인 식 시스템 등 고성능 이미지 처리 시장의 성능과 통 합 요건을 만족하고 또한 인코딩, 스트리밍 및 트랜 스코딩 장비, 화상회의 시스템, 비디오 폰 등 방송 및 IP TV © 조준동, 2006년 여름 45 HiBRID-SoC Architecture HIBRID-SoC multi-core system-on-chip Architecture Integrate a powerful on-chip communication structure A well-balanced memory system to account for the growing amount of data memory system (e.g., in the area of video, Mpeg-4 part 10 or Advanced Video Coding (AVC)) Dedicated chips for the Mpeg-4 Simple Profile, consists of a very general processing demend Three programmable cores Each adapted towards a specific class of algorithms Combination of the cores and their software development environment An extention of a programmable core with dedicated modules (e.g.,Trimedia) HIBRID-SoC multi core Developed at the University of Hannover © 조준동, 2006년 여름 46 Morpho(MS1-16 v002) • -16 16x16-bit MAC operations/cycle http://www.morphotech.com/ @500MHz • -4.8 GMAC/s @500MHz • -420 and 480 TBGA packaging (0.18u/0.13u) • -core voltages at 1.8V/1.2V and 3.3V © 조준동, 2006년 여름 47 SandBridge(SB3000) • -응용 분야 • MPEG capture and playback for video or videoconferencing • JPEG capture and playback for camera and display functions • MP3 capture and playback for music or ringtone functions • Other computation-intensive multimedia functions (e.g. speech control) © 조준동, 2006년 여름 48 NEC와 ARM • 전화 및 가전제품용으로 2개 이상의 프로세싱 코어와 연산 유닛을 마이크로프 로세서 안에 장착한 새로운 칩을 개발중이다. • 휴대폰 제조업체들에게도 다양한 활용방안이 있다. 예를 들면 코어중 하나는 통 화기능에, 다른 하나는 인터넷 트래픽 관리에 사용할 수 있다. • 1.2GHz의 ARM11 프로세서 제품군과 동급의 성능을 기록했으며 600 mW가 량의 전력소비량으로 최고 1440 DMIPS의 성능, 130nm 프로세스 • 리눅스 SMP OS 포트 :전력 소비 절감과 프로세서간 애플리케이션 로드 자동 밸런싱 기능 © 조준동, 2006년 여름 49 Efficient Shared DRAM Subsystems for SOCs, Sonics Inc. • Increases SOC performance • Improves efficiency of off-chip DRAM by up to 40% • Guarantees Quality of Service for on-chip cores • Lowers SOC costs • Consolidates and reduces multiple distributed buffers • Single Smart Interconnet replaces multiple layered busses • Shortens time to market • Smart Interconnect removes wire routing problem of classical architectures • Accurate architectural exploration ensures functionality in first days of development • Increased market penetration • DRAM technology selection decoupled from the rest of the SOC • Threaded architecture enables easy scalability without re-design of memory subsystem © 조준동, 2006년 여름 50 Shared Bus and DRAM Subsystem The traditional computer bus organization suffers from low DRAM efficiency, a lack of quality-of-service. © 조준동, 2006년 여름 51 Star Topology Access to a Shared DRAM Subsystem • 1) 2) 3) 4) DRAM controller sees all initiator requests at the same time and can select the order of servicing Them optimize the performance of the DRAM subsystem by reordering requests providing flexible quality-of-service to each of the initiators. causes a large number of wires to converge on the DRAM controller, producing physical problems for the design. © 조준동, 2006년 여름 52 SiliconBackplane and DRAM Scheduler The shared μnetwork remedies the wire congestion problem. The DRAM scheduler addresses both the DRAM performance issues. Quality-of-service guarantees by selectively scheduling the DRAM accesses. © 조준동, 2006년 여름 53 Unified DRAM and QoS Scheduling © 조준동, 2006년 여름 54 Set-top-box System © 조준동, 2006년 여름 55 Bandwidth Profile of Bus with Round-Robin Arbitration Many of the initiators receive less than their required bandwidth so overall application requirements are unsatisfied. © 조준동, 2006년 여름 56 Bandwidth Profile of Bus with Priority Arbitration From 5000 to 9000 cycles all but two initiators (CPU and DSP) receive no service at all. Clearly, this is unacceptable for the set-top-box application. © 조준동, 2006년 여름 57 Bandwidth Profile of Sonics Solution Each of the initiators are connected to the DRAM using a Silicon Backplane μNetwork, a DRAM scheduler, DRAM controller. The Silicon Backplane and DRAM bandwidth have been allocated to the different initiators according to their needs. All application requirements are met and overall DRAM utilization is pretty steady at around 70%. © 조준동, 2006년 여름 58 References • Terry Tao Ye, On-Chip Multiprocessor Communication Network Design and Analysis, Ph.D. Dissertation, Stanford Univ. • E. Bolotin, et al., Automatic hardware-Efficient SoC Integration by QoS network on Chip, Israel Institute of Tech, Haifa, Israel. • E. Bolotin, et al., Efficient Routing in Irregular Topology NoCs, Technion- Israel Institute of Tech © 조준동, 2006년 여름 59 The Standford Hydra CMP • • • • • • Lance Hammond Benedict A. Hubbert Michael Siu Manohar K. Prabhu Michael Chen Kunle Olukotun 성균관대학교 정보통신공학부 Presented by Jason Davis 60 © 조준동 2006년 가을 Introduction • • • Hydra CMP with 4 MIPS Processors L1 cache for each CPU and L2 cache that holds the permanent states Why? – Moore’s law is reaching its end – Finite amount of ILP – TLP (Thread Level Parallelism) vs ILP in pipelined architecture – CMP can use ILP as well (TLP and ILP are orthogonal) – Wire Delay – Design Time (CPU core doesn’t need to be redesigned) just increase the number • Problems – Integration densities just now giving reasons to consider new models – Difficult to convert uniprocessor code – Multiprogramming is hard © 조준동, 2006년 여름 61 Base Design • 4 MIPS Cores (250 MHz) – Each core: • • L1 Data Cache • L1 Primary Instruction Cache – Share a single L2 Cache • – Virtual Buses (pipelined with repeaters) • Read bus (256 bits) – Acts as general purpose system bus for moving data between CPUs, L2, and external memory – Wide enough to handle entire cache line (CMP explicit gain, multiprocessor systems would require too many pins Write bus (64 bits) – Writes directly from 4 CPUs to L2 – Pipelined to allow for single-cycle occupancy (not a bottleneck) – Uses simple invalidation for caches (broadcast invalidates all other L1s) L2 Cache – • Point of communication (10-20 cycles) Bus Sufficient for 4-8 MIPS cores, more need larger system buses © 조준동, 2006년 여름 62 Base Design © 조준동, 2006년 여름 63 Parallel Software Performance © 조준동, 2006년 여름 64 Thread Speculation • Takes sequence of instructions on normal program and arbitrarily breaks it into a sequenced group of threads – Hardware must track all interthread dependencies to insure program acts the same way – Must re-execute code that follows a data violation based upon a true dependency • Advantages: – Does not require synchronization (different than enforcing dependencies on multiprocessor systems) – Dynamic (done at runtime) so programmer only needs to consider for maximum performance – Conventional Parallelizing compilers miss a lot of TLP because synchronization points must be inserted where dependencies can happen and not just where they do happen • 5 Issues to address: © 조준동, 2006년 여름 65 Thread Speculation 1. Forward data between parallel threads 2. Detect when reads occur to early (RAW) 3. Safely Discard speculative state after violations © 조준동, 2006년 여름 66 Thread Speculation 4. Retire speculative writes in correct order (WAW hazard) 5. Provide Memory renaming (WAR hazards) © 조준동, 2006년 여름 67 Hydra Speculation Implementation • Takes care of the 5 issues: – Forward data between parallel threads: • When thread writes to bus, newer threads that need the data have their current cache lines for that data invalidated • On miss in L1, access L2, write buffers of current or older thread replaces data returned from L2 byte-byte – Detect when read occurs too early: • Primary cache bits are set to mark possible violations, if write to that address of an earlier thread invalidates – Violation detected and thread is restarted. – Safely discard speculative states after violation: • Permanent state kept in L2, any L1 lines that are speculative data are invalidated, L2 buffer for thread is discarded (permanent state not effected) © 조준동, 2006년 여름 68 Hydra Speculation Implementation – Place speculative writes in memory in correct order: • Separate speculative data L2 buffers kept for each thread • Must be drained into L2 in original sequence • Thread sequencing system also sequences the buffer draining – Memory Renaming: • Each CPU can only read data written by itself or earlier threads • Writes from later threads don’t cause immediate invalidations (since writes from these threads should not be visible yet) • Ignored invalidations are recorded with pre-invalidate bit • If thread accesses L2 it must only access data it should be able to see from itself or earlier L2 buffers • When current thread completes all currently pre-invalidated lines are check against future threads for violations © 조준동, 2006년 여름 69 Hydra Speculation Implementation © 조준동, 2006년 여름 70 Hydra Speculation Implementation © 조준동, 2006년 여름 71 Speculation Performance © 조준동, 2006년 여름 72 Prototype • • • • • • MIPS-based RC32364 SRAM macro cells 8-Kbyte L1 data and instruction caches 128 Kbytes L2 Die is 90 mm^2, .25-micron process Have a verilog model, moving to physical design using synthesis • Central Arbritration for Buses will be the most difficult part, hard to pipeline, must accept many requests, and must reply with grant signals © 조준동, 2006년 여름 73 Prototype © 조준동, 2006년 여름 74 Prototype © 조준동, 2006년 여름 75 Conclusion • Hydra CMP – High performance - Cost effective alternative to large chip single processors - Similar die area can achieve similar to uniprocessor performance on integer programs using thread speculation - Multiprogrammed or High Parallelism can do better then single processor - Hardware Thread-Speculation is not cost intensive, and can give great gains to performance © 조준동, 2006년 여름 76