A Domain Specific On-Chip Network Design for Large Scale Cache Systems Yuho Jin, Eun Jung Kim, Ki Hwan Yum HPCA 2007 Motivation • Large caches are becoming the norm of the day – Not optimized/ larger access times • Overprovision and underutilization of network resources(!!) – Can same performance be achieved with lesser resources? Contributions • Single stage multicast router – Not really new! (Not really feasible??) • New network topology for large caches (banks) – Minimizes the number of links in system • A new routing algorithm – For the new topology • A new replacement policy in NUCA caches – FAST-LRU, exploits multicast Overall aim was to reduce network overhead for minimal performance loss!! Router Architecture • 4 VCs per PC • Near 1 stage router •Look-ahead routing •Buffer bypass •Spec Switch Alloc •Arbitration precomputation Co-ordinate System Multicast Support • • • Multicast: Sync vs Async Async Multicast => flit replication=> buffer space Do it without extra h/w? – Use existing VC buffers • Copy flit to a different PC buffer •Use lesser used PCs • Get a free VC • Send flit to different destinations Figure courtesy: Chita R Das, OCIN ‘06 Set Associative Cache TAG SET OFF Ways-> 0 Set 0 Set 1 Set 3 Set 4 1 2 3 Fast-LRU Replacement • Bank Set arrangement Sets distributed among banks Column ..……… S0, W1 S1, W0 ……… S0, W2 S0, W3 S0, W4 TAG(12) SET(10) COL( 4) ……… ……… ……… OFF(6) Multicast Fast-LRU Network Topology Access Patterns • • • • • • A – data request B: Check for data B/C: Move data D/E : Data to Core A’/B’: Hit or Miss F: Data delivery from mem to MRU • G: Dirty block to mem Network Topology => Horizontal Links mostly not required Except here!! The HALO Network Implications • Underutilized links removed => simplified network, area savings(?) – Power savings come for free!! (lesser buffers) • Links removed => constrained routing – XYX routing proposed • XY is dimension order • Algorithm simplified - – If going from core/mem to $ bank travel horizontal first – If going from $ bank to core/mem, travel vertical first Example Yoff = -ve , Xoff = +ve -> Channel = Y- Example Yoff = +ve, Xoff = +ve Channel -> X+, then Y+ Results Results Avg Access latency They target to optimize network latency – 50 -60% of total latency IPC Comparisons