Structure of a Graphics Adapter Color Representation Video Memory Graphics Accelerators 3D Accelerators Graphics Processing Units Digital Interfaces for Monitors 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 1 Graphics Processing Units Overview GPGPU Computing The CUDA Architecture The Kepler GK110 Architecture 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 2 GPU – Graphics Processing Unit Dedicated graphics processors for PCs, workstations, and game consoles Initially used to accelerate the rendering stage for 3D graphics (e.g., texture mapping) Later also used to accelerate the geometric computations (rotation, translation) GPUs contain shader units, modules for texture mapping, anti-aliasing etc. 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 3 Vertex shader units Transform the 3D position of each vertex to the 2D coordinates on the screen and to the depth value for the z-buffer Modify the attributes of vertices: position, color, texture coordinates Geometry shader units Generate geometric figures or add volumetric details to objects 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 4 Pixel/fragment shader units Determine the color, z depth, and alpha value for each pixel or fragment Unified shader units Programmable units Able to perform various shading operations (vertex, geometry, pixel) GPUs contain an array of computing units and a unit that distributes the operations to be performed 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 5 The architecture with programmable units allows a more flexible use of the hardware resources The programmable units can also be used for other types of computations A flexible parallel architecture is obtained GPUs also include modules for 2D acceleration, MPEG compression, highdefinition video decoding 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 6 GPUs can be dedicated or integrated Dedicated GPUs Used in graphics cards interfaced with the motherboard via a PCI Express bus or AGP (Accelerated Graphics Port) interface Have a dedicated memory to the card use Examples AMD Radeon HD 8xxx (e.g., 8970) NVIDIA GeForce GTX 7xx (e.g., 780) 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 7 Integrated GPUs Are integrated into a chipset or processor Use a portion of the system memory Have lower performance compared to dedicated GPUs Examples Intel HD Graphics (e.g., HD Graphics 4600) AMD Radeon HD 8xxx in APU (Accelerated Processing Unit) processors NVIDIA in Tegra 4 and Tegra 4i processors 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 8 The design of GPUs was influenced by the 2D and 3D programming interfaces Implement API functions in hardware OpenGL (Open Graphics Library) For various platforms and languages Functions to draw 3D scenes from primitives Direct3D (component of DirectX) Only for the Microsoft operating systems Low-level interface to the 3D hardware functions 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 9 Technologies for connecting multiple GPUs on different graphics cards NVIDIA: SLI (Scalable Link Interface) 2 .. 4 identical graphics cards are connected via a motherboard (PCIe x 16) AMD: CrossFireX Up to 4 graphics cards can be connected The graphics cards do not have to be identical The cards have external connectors 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 10 Graphics Processing Units Overview GPGPU Computing The CUDA Architecture The Kepler GK110 Architecture 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 11 GPGPU (General Purpose computing on GPU) The GPU processing cores provide massive FP computational power Example: a single NVIDIA Tesla K40 GPU (2,880 cores) achieves 4.29 TFLOPS The graphics pipeline can also be used for general-purpose applications The performance can be orders of magnitude higher than that of conventional CPUs 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 12 GPUs can process independent vertices and pixels/fragments stream processors Stream: set of records that require similar computation Kernel function: applied to each element in the stream Shared memories cannot be used Ideal GPGPU applications: large data sets, high parallelism, reduced dependencies 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 13 Disadvantages of GPGPU computing: The programmer needs to be familiar with the graphics APIs and the GPU architecture Problems need to be expressed in terms of coordinates, textures, shader functions The need to use graphics programming languages: OpenGL, DirectX, Cg API extensions for running some program functions on GPU's processors: CUDA (NVIDIA), OpenCL (Khronos Group) 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 14 Graphics Processing Units Overview GPGPU Computing The CUDA Architecture The Kepler GK110 Architecture 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 15 CUDA (Compute Unified Device Architecture) Software and hardware architecture Enables GPUs to execute programs written in C, C++, Fortran, OpenCL languages Allows to use Microsoft's DirectCompute API Allows to access directly the GPU resources for general-purpose computing Exploits the GPU's capability to operate on large matrices in parallel 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 16 A CUDA program calls kernel functions executed by threads Threads are organized into blocks and groups of blocks (grids) Thread block: Set of concurrent threads Communicate via a shared memory Each thread has an identifier, registers, private memory, inputs, outputs 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 17 Grid of blocks: Group (array) of thread blocks The blocks execute the same kernel function Ensure synchronization between dependent kernel functions Results are shared in a global memory allocated to an application global synchronization 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 18 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 19 The hierarchy of threads is executed on a hierarchy of processors on the GPU Threads: executed by CUDA cores and other execution units Thread blocks: executed by a streaming multiprocessor (SM, SMX) Group of 32 threads: warp Grids of blocks: executed by the GPU 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 20 Graphics Processing Units Overview GPGPU Computing The CUDA Architecture The Kepler GK110 Architecture 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 21 The Kepler GK110 Architecture (1) Improvement of NVIDIA’s previous Fermi GPU architecture Contains 7.1 billion transistors Increased number of CUDA cores and double-precision arithmetic units Improved power efficiency up to 3x the performance/Watt of the Fermi architecture New architecture for the streaming multiprocessor (SMX) Enhanced memory subsystem 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 22 The Kepler GK110 Architecture (2) A full implementation contains: 15 SMX units with 192 CUDA cores each (15x192 = 2,880 cores) Six 64-bit memory controllers (384-bit interface) Common L2 cache memory for the SMX units PCI Express 3.0 interface to the CPU Global scheduler GigaThread Engine 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 23 The Kepler GK110 Architecture (3) 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 24 The Kepler GK110 Architecture (4) Each CUDA core contains: Integer arithmetic and logic unit Floating-point unit IEEE 754-2008 Fused multiply-add instruction (FMA) more accurate 64 double-precision (DP) units 32 Load/Store units (LD/ST) 32 special-function units (SFU) transcendental functions 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 25 The Kepler GK110 Architecture (5) 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 26 The Kepler GK110 Architecture (6) Threads are scheduled in groups of 32 (warps) Each SMX unit contains: Four warp schedulers Eight instruction dispatch units From each warp, two instructions can be dispatched in each cycle Each thread can access up to 255 registers 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 27 The Kepler GK110 Architecture (7) Memory subsystem Each SMX unit contains 64 KB of memory can be used as shared or L1 cache memory Configurations (shared/L1): 48 KB/16 KB; 16 KB/48 KB; 32 KB/32 KB 48 KB of read-only data cache memory 1536 KB of L2 cache memory: allows to share data between the SMX units The register files and memories are protected by an ECC code 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 28 Structure of a Graphics Adapter Color Representation Video Memory Graphics Accelerators 3D Accelerators Graphics Processing Units Digital Interfaces for Monitors 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 29 Digital Interfaces for Monitors DVI HDMI DisplayPort 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 30 DVI – Digital Visual Interface Developed by DDWG (Digital Display Working Group) Intended for liquid crystal monitors and digital projectors Based on the PanelLink technology of Silicon Image serial interface for uncompressed digital video data Partially compatible with HDMI (digital mode) and VGA (analog mode) interfaces 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 31 Contains signals for a DDC (Display Data Channel) between the monitor and computer Implemented with the ACCESS.bus serial bus (based on I2C) DDC2 provides bidirectional communication between the monitor and computer Allows for automatic system configuration The format of the configuration data is defined by the EDID (Extended Display Identification Data) standard EDID EPROM 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 32 The TMDS Protocol Transition Minimized Differential Signaling Developed by Silicon Image Differential signaling is used Minimizes the number of transitions for the signals from 1 to 0 and conversely 8b/10b encoding A TMDS link consists of a TMDS transmitter and a TMDS receiver 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 33 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 34 Contains three identical encoders The inputs of each encoder are 8 bits for pixel data and 2 control bits In each clock cycle, the encoder generates a 10-bit character: From the 8 data bits, or From the 2 control bits The output of each encoder is a continuous stream of serialized TMDS characters 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 35 Maximum pixel clock frequency: 165 MHz The binary data rate of a TMDS channel: 10 x pixel clock frequency For a TMDS link: 3 x 1.65 = 4.95 Gbits/s Maximum pixel rate: 165 megapixels/s 2.75 megapixels/frame at 60 Hz Maximum resolution: 19201440 (4:3) or 20481152 (16:9) at 60 Hz Increasing the resolution: dual TMDS link The connector contains pins for two links 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 36 Refresh Rate Single TMDS Link Dual TMDS Link 60 Hz 1920 1080 (HDTV) 2048 1536 (QXGA) 75 Hz 1280 1024 (SXGA) 2048 1536 (QXGA) 85 Hz 1280 1024 (SXGA) 1920 1080 (HDTV) Maximum resolutions supported by DVI 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 37 Types of connectors DVI-I (DVI-Integrated): contains the digital signals for a single- or dual-link and the analog signals (a) DVI-D (DVI-Digital-only): contains only the digital signals (b) 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 38 Digital Interfaces for Monitors DVI HDMI DisplayPort 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 39 HDMI – High-Definition Multimedia Interface Audio/video interface for uncompressed digital data For connecting A/V sources to computer monitors, digital TVs, digital audio devices Enables to send on a single cable: Various TV and PC video formats Up to 8 digital audio data streams Auxiliary data and control information 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 40 Uses the TMDS protocol HDMI signals are electrically compatible with the DVI signals passive adapter Video period: for the pixels of an active video line (8b/10b); includes horizontal and vertical blanking intervals Data period: for audio and auxiliary data packets (4b/10b) audio mute, color depth, color space Control period: between video and data periods 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 41 Version 1.0 (2002) Maximum bandwidth of 4.95 Gbits/s (165 MHz) resolution of 19201200 (WUXGA) at 60 Hz Version 1.1 (2004) Supports the DVD Audio format Version 1.2 (2005) Supports the SACD (Super Audio CD) format Allows PC applications to only support the RGB color space 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 42 Version 1.3 (2006) Bandwidth of 10.2 Gbits/s (340 MHz) resolution of 25601600 (WQXGA) at 60 Hz Supports video images with more colors: 30, 36, or 48 bits/pixel (Deep Color, optional) Supports the Dolby TrueHD and DTS-HD Master Audio formats (optional) Two types of cables: Category 1: up to 74.25 MHz (720p or 1080i) Category 2: up to 340 MHz (1080p or more) A smaller connector: type C 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 43 Version 1.4 (2009) The same bandwidth Resolutions of 4K2K: 38402160p (Quad HD) at 24, 25, or 30 Hz; 40962160p at 24 Hz HDMI Ethernet channel (100 Mbits/s) Audio return channel (ARC) Stereoscopic 3D formats Micro HDMI connector (type D) Automotive connection system 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 44 Version 1.4a (2010) Specifies two new mandatory 3D formats Version 1.4b (2011) Support for resolution of 19201080p, 120 Hz The HDMI Forum (www.hdmiforum.org) has been created in 2011 Version 2.0 (2013) Bandwidth has increased to 18 Gbits/s 4K2K resolutions at 60 Hz 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 45 HDMI connections Single-link: pixel rate of 25 MHz .. 340 MHz Dual-link: pixel rate of 25 MHz .. 680 MHz Audio formats Uncompressed audio: PCM (Pulse Code Modul.) Sampling rates: 32; 44.1; 48; 96; 192 KHz Sample sizes: 16, 20, or 24 bits Compressed audio: Dolby Digital, DTS Lossless compressed audio: Dolby TrueHD, DTS-HD Master Audio 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 46 Video formats Color spaces: RGB, YCbCr, xvYCC (optional) YCbCr: Y luminance and synchronization; Cb and Cr chroma (Cb = B Y, Cr = R Y) xvYCC: chroma values may correspond to negative RGB values more saturated colors Deep Color option: 10 bits, 12 bits, or 16 bits per color component 12 bits per color component: 68.7 billion colors 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 47 CEC (Consumer Electronics Control) One-wire bidirectional serial bus used to transfer remote control commands One Touch Play, System Standby, Tuner Control The user can control several devices connected through HDMI with a single remote control Devices can command each other without user intervention Alternative names: Anynet+ (Samsung), BRAVIA Link (Sony), EasyLink (Philips) 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 48 Connectors Type A: 19 pins, single-link connection Type B: 29 pins, dual-link connection Type C: mini-connector, 19 pins; can be connected to a Type A connector Type D: micro-connector, 19 pins (similar to micro-USB) Type E: for automobiles 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 49 Digital Interfaces for Monitors DVI HDMI DisplayPort 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 50 DisplayPort Overview DisplayPort Architecture Embedded DisplayPort (eDP) 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 51 Developed by VESA (Video Electronics Standards Association) Intended to replace the DVI and VGA interfaces, and the LVDS (Low-Voltage Differential Signaling) protocol DisplayPort and HDMI interfaces may coexist in consumer electronics devices Versions of DisplayPort specifications Version 1.0: published in 2006 Version 1.2: published in 2009 Version 1.3: published in 2014 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 52 Main link (unidirectional) 1, 2, or 4 lanes The transmission protocol is based on micro packets pixel and audio data 8b/10b encoding the clock signal is embedded into the data stream Auxiliary (AUX) channel (bidirectional) For device control and auxiliary data Default (standard) mode: Manchester encoding Fast mode: 8b/10b encoding 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 53 Allows external and internal connections For internal connections of portable computers: Embedded DisplayPort (eDP) Copper or fiber optic cables DisplayPort signals are not compatible with DVI and HDMI signals Optional dual-mode: DVI/HDMI signals can be generated with a simple converter The main link and AUX channel transmit 3 TMDS signals, a clock signal, and DDC data 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 54 Video data 18, 24, 30, 36, or 48 bits per pixel (bpp) High resolutions, refresh rates, and color depths (up to 4096 2160, 24 bpp, 60 Hz) Audio data 1..8 channels, uncompressed data Sampling rates: 48; 96; 192 KHz Sample sizes: 16 or 24 bits Maximum bit rate: 6.144 Mbits/s 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 55 Improvements in version 1.2 The bandwidth is doubled to 5.4 Gbits/s per lane higher resolutions Multiple independent audio/video streams Up to 63 A/V streams across a single connection Higher speed of the auxiliary channel USB peripherals, video cameras, touch panel data Support for stereoscopic 3D images Addition of the Mini DisplayPort connector (Apple) 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 56 Improvements in version 1.3 Bandwidth has been increased to 8.1 Gbits/s per lane (with 4 lanes: 32.4 Gbits/s) The bandwidth allows: Two 4K monitors (4096 2160) at 60 Hz A 4K stereo 3D display A combination of 4K display and USB 3.0 A 5K display (5120 2880) in RGB mode An UHD 8K television display (7680 4320) 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 57 Connectors and cables 20 pins for external connections Powered connectors 3.3 V, 500 mA (~1.5 W) Cable length: up to 2 m for full bandwidth; 15 m for reduced bandwidth 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 58 DisplayPort Overview DisplayPort Architecture Embedded DisplayPort (eDP) 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 59 Hot Plug Detect signal: 0 V or 3.3 V Indicates the presence or absence of a monitor May signal an interrupt from the monitor 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 60 Main link (version 1.2) Clock signal frequency: 162; 270; 540 MHz Raw bit rates: 1.62; 2.7; 5.4 Gbits/s per lane Actual bit rates: 80% of raw bit rates (with 4 lanes: 5.18; 8.64; 17.28 Gbits/s) No. of displays supported (24 bpp, 60 Hz) Resolution Resolution name No. of displays 12/18/2014 1280x768 1680x1050 1920x1080 2560x1600 4096x2160 WXGA WSXGA HDTV WQXGA 4 K x 2K 10 5 4 2 1 Input/Output Systems and Peripheral Devices (06-2) 61 Auxiliary channel Default mode: 1 Mbit/s (200 Kbits/s duplex) Fast mode: 720 Mbits/s (200 Mbits/s duplex) Used by the video source (GPU) to identify the capabilities of the monitor Rendering capabilities: reading the monitor EDID memory Support of video content protection: HDCP (High-bandwidth Digital Content Protection) key exchanges 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 62 Allows to maintain link integrity The monitor can notify the video source if data errors have occurred on the main link Can transport auxiliary data Camera and microphone data, USB 2.0 data Can be used to control monitor setting and operation Supports the VESA MCCS (Monitor Control Command Set) standard: commands to control the properties of monitors I2C channel 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 63 Example of application using AUX data transport 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 64 DisplayPort Overview DisplayPort Architecture Embedded DisplayPort (eDP) 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 65 Interface for connecting video adapters to display panels of portable computers Typically, an interface based on the LVDS electrical protocol is used (e.g., LDI) Based on the DisplayPort standard Same electrical interface Same basic digital protocol Can use the same GPU video port as external DisplayPort connections 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 66 Data transferred in main link: Pixel data and timing (e.g., pixel clock) Video format information (e.g., color space, bpp) ECC (Error Correction Code) for video data Data transferred in AUX channel: EDID information Display control: brightness control; dynamic backlight control; frame rate control (FRC) Power state control 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 67 Advantages compared to LVDS interfaces A single connector (data, control, power signals) 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 68 Reduced wire count simplified cable Example for a resolution of 19201080, 24 bpp: 4 signal wires vs. 20 signal wires Lower electromagnetic interference Enables new display panel control capabilities Reduced power consumption (e.g., display panel self-refresh feature) The packet-based protocol is extensible 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 69 Enables highly integrated display controllers 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 70 GPUs are used to accelerate the geometric stage and the rendering stage of 3D graphics Can be dedicated or integrated in a chipset GPUs contain a large number of processing cores, programmable for various shadings The processing power of GPUs can also be used for applications that require vector operations The CUDA architecture allows to access directly the GPU resources for general-purpose computing 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 71 The DVI interface has been designed for liquid crystal displays and digital projectors Uses a serial interface for uncompressed video data and the TMDS signaling protocol The HDMI interface is intended especially for consumer electronics devices Allows to send video data, audio data, and control information over a single cable Uses the same signaling protocol as the DVI interface, while providing a larger bandwidth 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 72 The DisplayPort interface will replace the VGA and DVI interfaces Uses a new protocol based on micro packets Video and audio data are sent on a main link with up to 4 lanes Control information is sent on an auxiliary channel The eDP interface is intended for internal display panels Uses the same protocol as DisplayPort Has several advantages compared to LVDS 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 73 Types of shader units contained by GPUs Unified shader units Dedicated and integrated GPUs Advantages and disadvantages of GPGPU computing The CUDA architecture Thread block in the CUDA architecture Grid of blocks in the CUDA architecture The DDC channel used by the DVI interface 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 74 The TMDS protocol and link Overview of the HDMI interface Features of HDMI version 1.4 The CEC (Consumer Electronics Control) bus Overview of the DisplayPort interface Improvements in version 1.2 of DisplayPort Functions of DisplayPort auxiliary channel Overview of the eDP interface Advantages of the eDP interface 12/18/2014 Input/Output Systems and Peripheral Devices (06-2) 75