Slides

Boosting Mobile GPU Performance with a

Decoupled Access/Execute Fragment Processor

José-María Arnau

, Joan-Manuel Parcerisa (UPC)

Polychronis Xekalakis (Intel)

Focusing on Mobile GPUs

1

Market demands

Energy-efficient mobile GPUs

2

Technology limitations

1 http://www.digitalversus.com/mobile-phone/samsung-galaxy-note-p11735/test.html

Samsung galaxy SII vs Samsung Galaxy Note when running the game Shadow Gun 3D

2 http://www.ispsd.com/02/battery-psd-templates/

Jose-Maria Arnau , Joan-Manuel Parcerisa, Polychronis Xekalakis 2

GPU Performance and Memory

A mobile single-threaded GPU with perfect caches achieves a speedup of 3.2x

on a set of commercial

Android games





Graphical workloads:

 Large working sets not amenable to caching

 Texture memory accesses are fine-grained and unpredictable

Traditional techniques to deal with memory:

 Caches

 Prefetching

 Multithreading


Outline



Background



Methodology



Multithreading & Prefetching



Decoupled Access/Execute



Conclusions


Assumed GPU Architecture


Assumed Fragment Processor



Warp : group of threads executed in lockstep mode (SIMD group)

 4 threads per warp

 4-wide vectorial registers (16 bytes)

 36 registers per thread


Methodology

Main memory

Pixel/Textures caches

L2 cache

Number of cores

Warp width

Register file size

Number of warps

Latency = 100 cycles

Bandwidth = 4 bytes/cycle

2 KB, 2-way, 2 cycles

32 KB, 8-way, 12 cycles

4 vertex, 4 pixel processors

4 threads

2304 bytes per warp

1-16 warps/core

Power Model : CACTI 6.5 and Qsilver


Workload Selection

2D games

 Small/medium sized textures



Texture filtering: 1 memory access



Small fragment programs

Simple 3D games

 Small/medium sized textures



Texture filtering: 1-4 memory accesses



Small/medium fragment programs

Complex 3D games



Medium/big sized textures



Texture filtering: 4-8 memory accesses



Big, memory intensive fragment programs


Improving Performance Using Multithreading



Very effective



High energy cost (25% more energy)



Huge register file to maintain the state of all the threads

 36 KB MRF for a GPU with 16 warps/core (bigger than L2)


Employing Prefetching



Hardware prefetchers:



Global History Buffer



 K. J. Nesbit and J. E. Smith. “Data Cache Prefetching Using a Global History Buffer”. HPCA, 2004.

Many-Thread Aware

 J. Lee, N. B. Lakshminarayana, H. Kim and R, Vuduc. “Many-Thread Aware Prefetching Mechanisms for

GPGPU Applications”. MICRO, 2010.



Prefetching is effective but there is still ample room for improvement


Decoupled Access/Execute



Use the fragment information to compute the addresses that will be requested when processing the fragment



Issue memory requests while the fragments are waiting in the tile queue



Tile queue size:





Too small: timeliness is not achieved

Too big: cache conflicts


Inter-Core Data Sharing



66.3% of cache misses are requests to data available in the L1 cache of another fragment processor



Use the prefetch queue to detect inter-core data sharing



Saves bandwidth to the L2 cache



Saves power (L1 caches smaller than L2)



Associative comparisons require additional energy


Decoupled Access/Execute

 33% faster than hardware prefetchers, 9% energy savings

 DAE with 2 warps/core achieves 93% of the performance of a bigger GPU with 16 warps/core, providing 34% energy savings


Benefits of Remote L1 Cache Accesses

 Single threaded GPU

 Baseline: Global History Buffer

 30% speedup

 5.4% energy savings


Conclusions



High performance, energy efficient GPUs can be architected based on the decoupled access/execute concept



A combination of decoupled access/execute -to hide memory latency- and multithreading -to hide functional units latency- provides the most energy efficient solution



Allowing for remote L1 cache accesses provides L2 cache bandwidth savings and energy savings



The decoupled access/execute architecture outperforms hardware prefetchers: 33% speedup, 9% energy savings


Boosting Mobile GPU Performance with a

Decoupled Access/Execute Fragment Processor

Thank you!

Questions?

José-María Arnau

(UPC)

Joan-Manuel Parcerisa (UPC)

Polychronis Xekalakis (Intel)

Slides

Related documents

Products

Support

Slides

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib