Slides

advertisement

Boosting Mobile GPU Performance with a

Decoupled Access/Execute Fragment Processor

José-María Arnau

, Joan-Manuel Parcerisa (UPC)

Polychronis Xekalakis (Intel)

Focusing on Mobile GPUs

1

Market demands

Energy-efficient mobile GPUs

2

Technology limitations

1 http://www.digitalversus.com/mobile-phone/samsung-galaxy-note-p11735/test.html

Samsung galaxy SII vs Samsung Galaxy Note when running the game Shadow Gun 3D

2 http://www.ispsd.com/02/battery-psd-templates/

Jose-Maria Arnau , Joan-Manuel Parcerisa, Polychronis Xekalakis 2

GPU Performance and Memory

A mobile single-threaded GPU with perfect caches achieves a speedup of 3.2x

on a set of commercial

Android games

Graphical workloads:

 Large working sets not amenable to caching

 Texture memory accesses are fine-grained and unpredictable

Traditional techniques to deal with memory:

 Caches

 Prefetching

 Multithreading

Jose-Maria Arnau , Joan-Manuel Parcerisa, Polychronis Xekalakis 3

Outline

Background

Methodology

Multithreading & Prefetching

Decoupled Access/Execute

Conclusions

Jose-Maria Arnau , Joan-Manuel Parcerisa, Polychronis Xekalakis 4

Assumed GPU Architecture

Jose-Maria Arnau , Joan-Manuel Parcerisa, Polychronis Xekalakis 5

Assumed Fragment Processor

Warp : group of threads executed in lockstep mode (SIMD group)

 4 threads per warp

 4-wide vectorial registers (16 bytes)

 36 registers per thread

Jose-Maria Arnau , Joan-Manuel Parcerisa, Polychronis Xekalakis 6

Methodology

Main memory

Pixel/Textures caches

L2 cache

Number of cores

Warp width

Register file size

Number of warps

Latency = 100 cycles

Bandwidth = 4 bytes/cycle

2 KB, 2-way, 2 cycles

32 KB, 8-way, 12 cycles

4 vertex, 4 pixel processors

4 threads

2304 bytes per warp

1-16 warps/core

Power Model : CACTI 6.5 and Qsilver

Jose-Maria Arnau , Joan-Manuel Parcerisa, Polychronis Xekalakis 7

Workload Selection

2D games

 Small/medium sized textures

Texture filtering: 1 memory access

Small fragment programs

Simple 3D games

 Small/medium sized textures

Texture filtering: 1-4 memory accesses

Small/medium fragment programs

Complex 3D games

Medium/big sized textures

Texture filtering: 4-8 memory accesses

Big, memory intensive fragment programs

Jose-Maria Arnau , Joan-Manuel Parcerisa, Polychronis Xekalakis 8

Improving Performance Using Multithreading

Very effective

High energy cost (25% more energy)

Huge register file to maintain the state of all the threads

 36 KB MRF for a GPU with 16 warps/core (bigger than L2)

Jose-Maria Arnau , Joan-Manuel Parcerisa, Polychronis Xekalakis 9

Employing Prefetching

Hardware prefetchers:

Global History Buffer

 K. J. Nesbit and J. E. Smith. “Data Cache Prefetching Using a Global History Buffer”. HPCA, 2004.

Many-Thread Aware

 J. Lee, N. B. Lakshminarayana, H. Kim and R, Vuduc. “Many-Thread Aware Prefetching Mechanisms for

GPGPU Applications”. MICRO, 2010.

Prefetching is effective but there is still ample room for improvement

Jose-Maria Arnau , Joan-Manuel Parcerisa, Polychronis Xekalakis 10

Decoupled Access/Execute

Use the fragment information to compute the addresses that will be requested when processing the fragment

Issue memory requests while the fragments are waiting in the tile queue

Tile queue size:

Too small: timeliness is not achieved

Too big: cache conflicts

Jose-Maria Arnau , Joan-Manuel Parcerisa, Polychronis Xekalakis 11

Inter-Core Data Sharing

66.3% of cache misses are requests to data available in the L1 cache of another fragment processor

Use the prefetch queue to detect inter-core data sharing

Saves bandwidth to the L2 cache

Saves power (L1 caches smaller than L2)

Associative comparisons require additional energy

Jose-Maria Arnau , Joan-Manuel Parcerisa, Polychronis Xekalakis 12

Decoupled Access/Execute

 33% faster than hardware prefetchers, 9% energy savings

 DAE with 2 warps/core achieves 93% of the performance of a bigger GPU with 16 warps/core, providing 34% energy savings

Jose-Maria Arnau , Joan-Manuel Parcerisa, Polychronis Xekalakis 13

Benefits of Remote L1 Cache Accesses

 Single threaded GPU

 Baseline: Global History Buffer

 30% speedup

 5.4% energy savings

Jose-Maria Arnau , Joan-Manuel Parcerisa, Polychronis Xekalakis 14

Conclusions

High performance, energy efficient GPUs can be architected based on the decoupled access/execute concept

A combination of decoupled access/execute -to hide memory latency- and multithreading -to hide functional units latency- provides the most energy efficient solution

Allowing for remote L1 cache accesses provides L2 cache bandwidth savings and energy savings

The decoupled access/execute architecture outperforms hardware prefetchers: 33% speedup, 9% energy savings

Jose-Maria Arnau , Joan-Manuel Parcerisa, Polychronis Xekalakis 15

Boosting Mobile GPU Performance with a

Decoupled Access/Execute Fragment Processor

Thank you!

Questions?

José-María Arnau

(UPC)

Joan-Manuel Parcerisa (UPC)

Polychronis Xekalakis (Intel)

Download