Dukki Hong1 Sang-Oak Woo3 Youngduke Seo1 Youngsik Kim2 Kwon-Taek Kwon3 Seok-Yoon Jung3 Kyoungwoo Lee4 Woo-Chan Park1 1Media Processor Lab., Sejong University 2Korea Polytechnic University 3SAIT of Samsung Electronics Co., Ltd. 4Yonsei University dkhong@rayman.sejong.ac.kr http://rayman.sejong.ac.kr October 3, 2013 Introduction Related Work ◦ Texture mapping ◦ Non-blocking Scheme Proposed Non-Blocking Texture Cache ◦ The Proposed Architecture ◦ Buffers for Non-blocking scheme ◦ Execution Flow of The NBTC Experimental Results Conclusion October 3, 2013 2 Texture mapping ◦ Core technique for 3D graphics ◦ Maps texture images to the surface Problem: a huge amount of memory access is required ◦ Major bottleneck in graphics pipelines ◦ Modern GPUs generally use texture caches to solve this problem Improving texture cache performance ◦ Improving cache hit rates ◦ Reducing miss penalty ◦ Reducing cache access time October 3, 2013 3 The visual quality of mobile 3D games have evolved enough to compare with PC games. ◦ Detailed texture images ex) Infinity blade : 2048 [GDC 2011] ◦ Demand high texture mapping throughput <Epic Games: Infinity Blade Series> <Gameloft: Asphalt Series> October 3, 2013 4 Improving texture cache performance ◦ Improving cache hit rates ◦ Reducing miss penalty ◦ Reducing cache access time “Our approach” In this presentation, we introduce a non-blocking texture cache (NBTC) architecture ◦ Out-of-order (OOO) execution ◦ Conditional in-order (IO) completion the same screen coordinate to support the standard API effectively October 3, 2013 5 Texture mapping Texture mapping is that glue n-D images onto geometrical objects ◦ To increase realism <Texture> Texture <Object> <Texture Mapped Object> filtering Texture filtering is a operation for reducing artifacts of texture aliasing caused by the texture mapping Bi-linear filtering : four samples per texture access Tri-linear filtering : eight samples per texture access <Results of the texture filtering> October 3, 2013 6 Cache performance study ◦ In [Hakura and Gupta 1997], the performance of a texture cache was measured with regard to various benchmarks ◦ In [Igehy et al. 1999], the performance of a texture cache was studied with regard to multiple pixel pipelines Pre-fetching scheme ◦ In [Igehy et al. 1998], the latency generated during texture cache misses can be hidden by applying an explicit pre-fetching scheme Survey of texture cache ◦ The introduction of a texture cache and the integration of texture cache architectures into modern GPUs were studied in [Doggett 2012] October 3, 2013 7 Non-blocking cache (NBC) ◦ allows the following cache request while a cache miss is handled Reducing the miss-induced processor stalls ◦ Kroft firstly published a NBC using missing information/status holding registers (MSHR) that keep track of multiple miss information [Kroft 1981] <Blocking Cache> <Non-blocking Cache with MSHR> Hit CPU Miss Penalty Miss stall only when result needed CPU CPU Miss Penalty Miss Penalty Miss Miss Penalty Miss Block valid bit Block request address Comparator <Kroft’s MSHR> Word 0 valid bit Word 1 valid bit Word 0 destination Word 1 destination Word 0 format ⁞ ⁞ ⁞ Word 1 format Word n Word n Word n valid desti- format bit nation 8 Performance study with regard to non-blocking cache ◦ Comparison with four different MSHRs [Farkas and Jouppi 1994]. Implicitly addressed MSHR : Kroft’s MSHR Explicitly addressed MSHR : complement version of implicitly MSHR In-cache MSHR : each cache line as MSHR The first three MSHRs : only one entry per miss block address Inverted MSHR: single entry per possible destination The number of entries = usable registers in a processor (possible destination) Reg #1 Reg #1 valid request bit address Comparator Reg #2 Reg #2 valid request bit address Comparator ⁞ ⁞ Reg #1 Reg #1 format address in block Reg #2 Reg #2 format address in block ⁞ ⁞ <Inverted MSHR organization> Match encoder Matching Register number PC PC PC PC ◦ Recent high-performance out-of-order (OOO) processor using the latest SPEC valid request format address benchmark [Libitet al. address 2011] in block Comparator non-blocking cache improved the OOO processor’s performance A hit under two-misses 17.76% more than the one using a blocking data cache 9 Proposed Non-Blocking Texture Cache October 3, 2013 10 Fragment Information Retry Buffer Lookup Lookup Retry Buffer Shading Unit Texture Request Texture Address Generation L1 Cache Miss Update Hit/Miss Router Waiting List Buffer Block Address Buffer Update Hit Texture Request Ready Texture Request Waiting List Buffer Update MUX Retry Buffer Update Texture Mapping Pipeline <Proposed NBTC architecture> Missed Texel Request Block Address Buffer |||| Missed Texture Request This architecture includes a typical blocking texture cache (BTC) of a level 1 (L1) cache as well as three kinds of buffers for non-blocking scheme: ◦ Retry buffer Guarantee IO completion ◦ Waiting list buffer Keep track of miss information ◦ Block address buffer Remove duplicate block address Triangle Request Address Queue ...... Fragment (Retry Buffer) DRAM or L2 Cache Texture or Request or Texel Request ...... texaddr ...... (Waiting List Buffer) (Block Address Buffer) October 3, 2013 11 Retry Buffer Texture request, Ready Filtered Texture Valid Screen (Filtering information, Texture address) Bit Data Bit Coordinate : : : : : : : : : Feature ◦ The most important property of the retry buffer (RB) is its support of IO completion The RB stores fragment information by input order The RB is designed as FIFO Data Format of each RB entry ◦ ◦ ◦ ◦ ◦ Valid bit : 0 = empty, 1 = occupied Screen coordinate : screen coordinate for output display unit (x, y) Texture request Ready bit : 0 = invalid filtered texture data, 1 = valid filtered texture data Filtered texture data : texture data for accomplished texture mapping October 3, 2013 12 Waiting List Buffer Texel Addr0 … 7 Valid Texture Filtering Texel Data0 … 7 Bit ID information Ready Bit0…7 Features ◦ The waiting list buffer (WLB) is similar to the inverted MSHR proposed in [Farkas and Jouppi 1994] The WLB stores information of both missed and hit addresses The texture address of the WLB plays a similar role as a register in the inverted MSHR Data format of each WLB entry ◦ ◦ ◦ ◦ ◦ ◦ Valid bit : 0 = empty, 1 = occupied Texture ID : ID number of a texture request Filtering information : the information to accomplish the texture mapping Texel addr N : the texture address of necessary texture data Texel data N : the texel data of Texel Addr N Ready bit N : 0 = invalid texe data N, 1 = valid texel data N October 3, 2013 13 Miss Address Block Address … Block Request Address Queue Address Feature ◦ The block address buffer operates the DRAM access sequentially with regard to the texel request that caused a cache miss The block address buffer removes duplicate DRAM requests When data are loaded, all the removed DRAM requests are found The block address buffer is designed as FIFO October 3, 2013 14 Fragment Information Retry Buffer Lookup Lookup Retry Buffer Shading Unit Start Texture Request Execute lookup RB Texture Address Generation L1 Cache Generate texture addresses Miss Update Hit/Miss Router Waiting List Buffer Block Address Buffer Update Hit Texture Request Ready Texture Request MUX Retry Buffer Update Waiting List Buffer Update Missed Texel Request All hits Block Address Buffer |||| Missed Texture Request Request Address Queue Execute tag compare with texel requests Hit handling case Occurred miss Miss handling case DRAM or L2 Cache Texture Mapping Pipeline October 3, 2013 15 Fragment Information Retry Buffer Lookup Lookup Retry Buffer Shading Unit Hit handling case Texture Request Read texel data from L1 cache Texture Address Generation L1 Cache Miss Update Hit/Miss Router Waiting List Buffer Block Address Buffer Update Hit Texture Request Ready Texture Request MUX Retry Buffer Update Waiting List Buffer Update Missed Texel Request Block Address Buffer |||| Missed Texture Request Request Address Queue Input texel data to texture mapping unit via MUX Execute texture mapping Update RB DRAM or L2 Cache Texture Mapping Pipeline October 3, 2013 16 Miss handling case Fragment Information Retry Buffer Lookup Lookup Retry Buffer Shading Unit Texture Request “Concurrent execution” Read hit texel data from L1 cache Input missed texture requests to WLB Texture Address Generation Input missed texel requests to BAB L1 Cache Miss Update Remove duplicate texel requests Hit/Miss Router Waiting List Buffer Block Address Buffer Update Hit Texture Request Ready Texture Request MUX Retry Buffer Update Waiting List Buffer Update Missed Texel Request Process the next texture request Block Address Buffer |||| Missed Texture Request Request Address Queue DRAM or L2 Cache Texture Mapping Pipeline October 3, 2013 17 Miss handling case Fragment Information Retry Buffer Lookup Lookup Retry Buffer Shading Unit Texture Request “Concurrent execution” Read hit texel data from L1 cache Input missed texture requests to WLB Texture Address Generation Input missed texel requests to BAB L1 Cache Miss Update Remove duplicate texel requests Hit/Miss Router Waiting List Buffer Block Address Buffer Update Hit Texture Request Ready Texture Request MUX Retry Buffer Update Texture Mapping Pipeline Waiting List Buffer Update Missed Texel Request Process the next texture request Block Address Buffer |||| Missed Texture Request Request Address Queue DRAM or L2 Cache Complete memory request Forward the loaded data to WLB and cache Input texel data to texture mapping unit via MUX Determine the ready entry in WLB Invalidate the entry Execute texture mapping Update RB October 3, 2013 18 Fragment Information Retry Buffer Lookup Lookup Retry Buffer Shading Unit Update RB Texture Request Determine the ready entry in RB Texture Address Generation L1 Cache Miss Update Hit/Miss Router Waiting List Buffer Block Address Buffer Update Hit Texture Request Ready Texture Request MUX Retry Buffer Update Waiting List Buffer Update Missed Texel Request Block Address Buffer |||| Missed Texture Request Request Address Queue Determine whether IO completion Forward the ready entry to the shading unit Process the next fragment infromation DRAM or L2 Cache Texture Mapping Pipeline October 3, 2013 19 Experimental Results October 3, 2013 20 Simulator configuration ◦ mRPsim : announced by SAIT [Yoo et al. 2010] Execution driven cycle-accurate simulator for SRP-based GPU Modification of the texture mapping unit Eight pixel processors DRAM access latency cycles : 50, 100, 200, and 300 cycles ◦ Benchmark Taiji which has nearest, bi-linear, and tri-linear filtering modes Cache configuration ◦ Four-way set associative, eight-word block size and 32KByte cache size ◦ The number of each buffer entries : 32 October 3, 2013 21 Total PS Cycles (M cycles) 15 12 NBTC stall cycles PS run cycles PS stall cycles 9 6 3 0 DRAM Access Latency (Cycles) Pixel shader cycle/frame ◦ ◦ ◦ ◦ PS run cycle : running cycles PS stall cycle : stall cycle NBTC stall cycle : stall cycles due to the WLB full The pixel shader’s execution cycle decreased from 12.47% (latency 50) to 41.64% (latency 300) 22 October 3, 2013 Miss Rate (%) 8 BTC NBTC 6 4 2 0 50 100 200 300 DRAM Access Latency (Cycles) Cache miss rates ◦ The NBTC’s cache miss rate increased slightly more than the BTC’s cache miss rate The NBTC can handle the following cache accesses in cases where a cache update is not completed October 3, 2013 23 Memory Bandwidth (MBytes) 7 BTC NBTC 6 5 4 50 100 200 300 DRAM Access Latency (Cycles) Memory bandwidth requirement ◦ The memory bandwidth requirement of the NBTC increased up to 11% more than that of the BTC Since the block address buffer removes duplicate DRAM requests, the increasing memory bandwidth requirement was relatively lower 24 A non-blocking texture cache to improve the performance of texture caches ◦ basic OOO executions maintaining IO completion for texture requests with the same screen coordinate ◦ Three buffers to support the non-blocking scheme: The retry buffer : IO completion The waiting list buffer : tracking the miss information The block address buffer : deleting the duplicate block address We plan to also implement hardware for the proposed NBTC architecture and then will measure both the power consumption and the hardware area of the proposed NBTC architecture October 3, 2013 25 Thank you for your attention http://rayman.sejong.ac.kr October 3, 2013 26 Backup Slides October 3, 2013 27 October 3, 2013 28