Developed by You 1 3D Optimization for Intel® Graphics, Gen9 Andrew Lauritzen - Graphics Software Engineer, Intel Corporation Michael Apodaca - Principal Engineer, Intel Corporation GVCS004 2 Agenda • Graphics API Overhead • Commands and State • Resource Binding • New GPU Features • Summary and Q&A 3 Graphics API Overhead 4 Graphics API Overhead • Most GPU vendors have complex drivers - Do lots of fancy optimizations on the fly - Costs CPU cycles, but makes the GPU run faster • Drivers spawn threads that conflict with application - Driver thread often consumes an entire core by itself - Plus another core for the game submission thread - Minimal multithreading beyond these two threads 5 CPU/GPU Optimization Trade-off • Has recently become far more serious with SoCs - Even if not “CPU bound”, CPU/GPU share power - More CPU load => less GPU power/performance • Complex CPU optimizations are not a good idea… - Tax on all applications, even well-optimized ones - CPU work can take more power than it saves on the GPU! - Leads to lower overall performance 6 Thinner Intel® Graphics Driver • To address this, Intel wrote a much thinner DirectX* 11 driver • Big benefit to well-written applications - But does far less work to make poor ones run well - i.e. no redundant state elimination, minimal state-based shader compiles, etc. • Still unavoidable CPU overhead due to API design - New APIs – DirectX 12, Vulkan, Metal - address this 7 High Efficiency Graphics APIs • Large increase in power efficiency - Power saved on the CPU can be given to the GPU - Applications can both run faster and use less power • Additional GPU optimization opportunities - i.e. stuff that we had to drop in the thinner driver - Pipeline state objects give the driver more context • This presentation will primarily use examples from DirectX* 12 - In most cases similar concepts apply to Vulkan and Metal 8 Power and Performance DirectX* 11 DirectX 12 Less power @ same performance DirectX 11 DirectX 12 More performance @ same power DirectX 12 can significantly reduce CPU power or improve performance Source: https://software.intel.com/en-us/blogs/2014/08/11/siggraph-2014-directx-12-on-intel 9 Commands and State 10 State in DirectX* 11 • DirectX* 11 context is “stateful” - State grouped into moderately sized chunks (Immutable State Objects) - e.g., RasterState, DepthStencilState, BlendState, etc. • Groupings do not always map perfectly to hardware - i.e., D3D BlendState != GPU BlendState - Also, driver must still validate state combinations and may perform GPU optimizations by combining multiple D3D state objects at Draw time 11 Input Assembler Vertex Shader Hull Shader Domain Shader Geometry Shader Raster State Pixel Shader Blend State Depth Stencil State State in DirectX* 12 • Immutable, monolithic Pipeline State Objects (PSOs) - Single object captures as much state as possible - Much lower chance of missing driver context - Allows state validation and optimizations to occur at load time - Allows more advanced link-time optimizations on shaders Pipeline State Input Assembler Vertex Shader Hull Shader Domain Shader Geometry Shader Raster State Pixel Shader Blend State Depth Stencil State 12 Pipeline State Objects (PSO) • Create PSOs at initialization time - Multithread your initialization/PSO creation code! - Use PSO “libraries”/caching Compress (LZMA or similar) many PSOs together before storing Often a lot of redundancy that can be compressed out • Drivers will optimize PSO transitions - e.g., Culling redundant GPU states - Doesn’t eliminate the need to sort to reduce state changes 13 DirectX* 11 Deferred Contexts • Limited parallelism with a single context • Deferred contexts do not address the problems: - CPU performance and cache issues with transient objects - State mismatch and lazy state setting - Inherited internal states - MAP_DISCARD renaming, hazard tracking, etc. - Non-trivial patching happens at submission time • Result: more overhead and limited parallelism 14 DirectX* 12 Direct Command Lists • Transient / Single-Use Command Lists CommandList • No state inheritance between Direct Command Lists - No API state or internal state inheritance (renaming, etc.) • Resource management - No implicit resource renaming - Explicit barriers to handle hazards and resource transitions 15 CommandList DirectX* 12 Bundles • Persistent / Re-executable Command Lists CommandList • Some minimal state inheritance is allowed - Some driver patching may occur at Execute time - If you don’t need to inherit something, set it (again) in the bundle Bundle Bundle CommandList • Note: Overhead is already very low in DirectX* 12 - Usually need ~10+ draws to make bundles a win - Don’t add any GPU overhead/indirections to enable bundles! 16 CommandList DirectX* 12 Command List Allocators • Container for GPU Command Buffer et al. allocations • Complexity: - Can only be accessed by a single thread; use ~1 allocator per thread - Reset is optimized for reusing with similarly sized Command Lists; therefore should be associated with same/similar Command Lists - Application doesn’t know the actual memory size Memory size ∝ Number of Draw calls • Recommendation: - Use a single allocator for multiple ‘small’ Command Lists - Use a single allocator for a single ‘large’ Command List - Reuse accordingly 17 DirectX* 12 Command Queues • Exposes GPU Parallelism - Multi-Engine and Multi-GPU • 3D and Compute are not simultaneous - Queues may pre-empt each other • Copy might be simultaneous with 3D and Compute - Driver will make decision based on GPU performance and capabilities 18 DirectX* 12 Execute Indirect • Expands on Indirect Draw/Dispatch • Command signature defines indirect argument buffer format • Indirect arguments can contain - Draw/dispatch parameters - Resource bindings • Optional count buffer 19 IB VB Draw UAV CBV Draw IB Args VB Args Draw Args UAV Args CBV Args Draw Args IB Args VB Args Draw Args UAV Args CBV Args Draw Args Execute Indirect on Gen9 • Must be executed on GPU • Internal Compute Shader patches Command List buffer • Optimize transition barriers on Argument / Count buffers • DirectX* 11 DrawIndirect: - Count = 1 - No resource bindings 20 Resource Binding 21 Resource Descriptors • Resources views are effectively just a small structure - Metadata and a pointer to memory (usually ~32-64 bytes) - Stuff like texture dimensions, format, layout, etc. • Direct3D* 12 directly exposes these “descriptors” - Independent from the actual memory they reference - Can be created/copied/etc. freely - Application must ensure no dangling pointers 22 Resource Descriptors • Not an API object – manipulated directly by application - Descriptor size query-able by application - Can be created at any time; free-threaded API call • Descriptors are put into “heaps” (arrays) - Constant buffer (CBV), shader resource (SRV) and unordered access (UAV) views can be mixed in one heap - Samplers in a separate heap • Changing heaps is expensive (pipeline flush) - Ideally use a single heap of each type (sampler, CBV/SRV/UAV) 23 Descriptors Example Descriptor Heap D3D12_UNORDERED_ACCESS_VIEW_DESC uavDesc = { ... }; device->CreateUnorderedAccessView(res, desc, [uavHandle]) D3D12_CONSTANT_BUFFER_VIEW_DESC cbvDesc = { ... }; device->CreateConstantBufferView(res, cbvDesc, [cbvHandle]); UAV CBV ... SRV CBV SRV SRV 24 Root Signature • Think of it like a function signature for your shader(s) • Defines parameters and how they map to shader inputs - Root constants (data: zero indirections) - Root descriptors (pointer to data: one indirection) - Descriptor tables (pointer to descriptors: two indirections) • Each parameter can be visible to one or more shader stages • Parameters are “versioned” by implementation/hardware - This is the single place the “stream” of versions are managed - Maximum size is very small to avoid abuse 25 Root Parameter Indirections Root Signature 0 Root Constants 1 Root Descriptor 2 Descriptor Table Memory Descriptor Heap … UAV CBV … 26 Root Constants • Pass a small number of constants directly to shaders - Bound to shader as a single constant buffer • Useful for simple indirections; draw ID, material ID, etc. - Avoids creating versioned memory, descriptor, heap, etc. - Shader can use to look up into arbitrary data structures 27 Root Descriptors • Stores a single descriptor directly as a root parameter - No need to burn through descriptor heap space - Most useful for a descriptor that changes ~ every draw • Can only reference “raw data” - Only buffer resources (CBVs, SRVs/UAVs of buffers) - No type conversions (i.e., only float/uint/sint components) - No out of bounds checking - i.e. it’s just a pointer to memory 28 Descriptor Tables • Maps continuous range of descriptors to shader slots - Can mix SRVs, UAVs, and CBVs arbitrarily - Multiple descriptor tables can point to disjoint ranges • Example: Use separate parameters for different update frequencies - Per-scene, per-material, per-instance, per-draw, etc. - Similar to constant buffers, now also for the descriptors too 29 Root Signature Example 0 Descriptor Table t1 b1 1 Descriptor Table u0 b2 t4 t5 D3D12_DESCRIPTOR_RANGE Param0Ranges[3]; Param0Ranges[0].Init(D3D12_DESCRIPTOR_RANGE_SRV, 1, 1); // t1 Param0Ranges[1].Init(D3D12_DESCRIPTOR_RANGE_CBV, 1, 1); // b1 Param0Ranges[2].Init(D3D12_DESCRIPTOR_RANGE_SRV, 2, 4); // t4-t5 D3D12_DESCRIPTOR_RANGE Param1Ranges[2]; Param1Ranges[0].Init(D3D12_DESCRIPTOR_RANGE_UAV, 1, 0); // u0 Param1Ranges[1].Init(D3D12_DESCRIPTOR_RANGE_CBV, 1, 2); // b2 // Visibility to all stages allows sharing binding tables D3D12_ROOT_PARAMETER Param[2]; Param[0].InitAsDescriptorTable(3, Param0Ranges, D3D12_SHADER_VISIBILITY_ALL); Param[1].InitAsDescriptorTable(2, Param1Ranges, D3D12_SHADER_VISIBILITY_ALL); 30 Root Signature Example 0 Descriptor Table t1 b1 1 Descriptor Table u0 b2 2 Shader Resource View t0 3 uint4 Constant b0 t4 t5 ... Param[2].InitAsShaderResourceView(1, 0); // t0 Param[3].InitAsConstants(4, 0); // b0 (4x32-bit constants) 31 Root Signature Example (HLSL) 0 Descriptor Table t1 b1 1 Descriptor Table u0 b2 2 Shader Resource View t0 3 uint4 Constant b0 t4 t5 DescriptorTable(SRV(t1), CBV(b1), SRV(t4, numDescriptors=2)), DescriptorTable(UAV(u0), CBV(b2)), SRV(t0), RootConstants(b0, num32BitConstants=4) 32 Binding Example Descriptor Heap Root Signature 0 Descriptor Table t1 b1 1 Descriptor Table u0 b2 2 Shader Resource View t0 3 uint4 Constant b0 t4 t5 UAV CBV SRV cmdList->SetGraphicsRootDescriptorTable(0, [srvGPUHandle]); cmdList->SetGraphicsRootDescriptorTable(1, [uavGPUHandle]); CBV cmdList->SetGraphicsRootConstantBufferView(2, [srvCPUHandle]); SRV cmdList->SetGraphicsRoot32BitConstants(3, {1,3,3,7}, 0, 4); 33 SRV Binding Example Descriptor Heap Root Signature 0 Descriptor Table t1 b1 1 Descriptor Table u0 b2 2 Shader Resource View t0 SRV 3 uint4 Constant b0 1, 3, 3, 7 t4 t5 UAV CBV SRV cmdList->SetGraphicsRootDescriptorTable(0, [srvGPUHandle]); cmdList->SetGraphicsRootDescriptorTable(1, [uavGPUHandle]); CBV cmdList->SetGraphicsRootConstantBufferView(2, [srvCPUHandle]); SRV cmdList->SetGraphicsRoot32BitConstants(3, {1,3,3,7}, 0, 4); 34 SRV Resource Binding • Root parameters map very directly to GPU implementation - Each draw/dispatch gets a versioned copy of the root parameters - Each root parameter gets passed to the shader register file via “push constants” - When execution unit (EU) thread launches, values are immediately available • Performance is very predictable - If you change any root parameter for a given shader stage, there’s a small cost i.e. a new version of the draw constants are created - Beyond that, low sensitivity to #/type of root parameters or # changed - Unlike previous architectures, no cost for “unused” descriptors, etc. 35 Descriptor Tables • All descriptors are referenced via the “bindless resources” path - i.e. the entire heap is indexed directly by the shader • Setting a descriptor tables is syntactic sugar - Driver passes the offset in the heap as a constant (DWORD) - Shader adds the offset to the binding index in the shader • Recommend applications switch to bindless style - Use a single descriptor table that maps the entire heap - Store binding indices with other material constants (cbuffers, root constants) - Enables new capabilities: all resources are accessible without CPU intervention 36 Binding Example Root Signature Draw Constants 0 Descriptor Table 5 1 Descriptor Table 1 2 Shader Resource View [64-bit SRV address] 3 uint4 Constant 1, 3, 3, 7 Descriptor Heap UAV CBV SRV cmdList->SetGraphicsRootDescriptorTable(0, [srvGPUHandle]); cmdList->SetGraphicsRootDescriptorTable(1, [uavGPUHandle]); CBV cmdList->SetGraphicsRootConstantBufferView(2, [srvCPUHandle]); SRV cmdList->SetGraphicsRoot32BitConstants(3, {1,3,3,7}, 0, 4); 37 SRV Static Samplers • Define sampler parameters right in the root signature - Or right in the shader with HLSL root signature language • If only static samplers are used then it isn’t necessary to manage a sampler heap anymore! • No GPU performance disadvantage - Driver places static samplers in the regular sampler heap - Same as application putting them there (doesn’t count against DirectX* 12 limits) 38 New GPU Features 39 Conservative Rasterization • New hardware rasterizer mode - Includes any pixels that are at least partially covered (outer conservative) - New pixel shader input indicates fully covered pixels (inner conservative) • Truly conservative implementation - No pixels missed or incorrectly flagged as fully covered - Degenerate triangles are not culled - Accounts for fixed point snapping error • Many use cases in the literature - Voxelization, uniform grids, binning, tiled rendering, anti-aliasing, shadows, occlusion culling, texture atlases and light maps, … 40 Conservative Rasterization 41 Raster Ordered Views • Aka. Pixel synchronization1 - Introduced Intel® Graphics Technology, Gen7.5 (4th generation processors) - Now standardized in DirectX* 11.3 and 12 • Enables ordered per-pixel read/write memory accesses - Efficiently construct deterministic (render order) per-pixel data structures • Many applications, and fits well with conservative rasterization - Order independent transparency, volumetric shadow maps, K-buffers, voxelization, anti-aliasing, etc. 1 http://advances.realtimerendering.com/s2013/2013-07-23-SIGGRAPH-PixelSync.pdf 42 Render Target Read • Shader read/modify/write of the bound render targets - Similar to raster ordered views, but through regular raster paths - i.e. includes format conversions, hits the raster caches, etc. • Particularly suitable for programmable blending - Example: decals with deferred shading - No need to move to general memory accesses and raster order views - Format conversions are automatic and efficient 43 Tiled Resources • Sparse 2D and 3D textures - Mega-textures - Sparse shadow maps - Sparse voxel octrees (coarse-grained) • Sparse buffers - Geometry streaming • Efficiently add/remove high detail mips - Respond to memory pressure 44 Standard Swizzle • Standard pixel swizzle pattern inside a 64KB tile • Enables directly mapping textures in hardware format - Stream textures from disk with no format conversions - Poke portions of textures directly from CPU • Efficient multi-adaptor sharing - Shared textures between GPUs from different vendors - … without several extra copies to go through linear/row-major intermediary 45 Other New GPU Features • Adaptive scalable texture compression (ASTC) • 16x multi-sample anti-aliasing (MSAA) • Post depth test coverage mask • Floating point atomics (min/max/cmpexch) • Min/max texture filtering • Multi-plane overlays 46 Summary and Q&A 47 Summary and Next Steps • High efficiency graphics APIs are a great fit for Intel hardware - Increased performance and predictability - Increased power efficiency • DirectX* 12 is already supported today - Many new GPU features are exposed 48 Additional Sources of Information • A PDF of this presentation is available from our Technical Session Catalog: www.intel.com/idfsessionsSF. This URL is also printed on the top of Session Agenda Pages in the Pocket Guide. • Follow @IntelSoftware and @DirectX12 • https://software.intel.com/en-us/gamedev • http://blogs.msdn.com/directx • Working on DirectX* 12 on Intel? - andrew.t.lauritzen@intel.com, @AndrewLauritzen 49 Other Technical Graphics Sessions Session ID Title Time Room SPCS003 Technology Insight: Next Generation Intel® Processor Graphics Architecture, Code Name Skylake Tues 1:15 – 2:15 Level 3 Room 3016 GVCS001 Power Optimization in Intel® Graphics Technology, Gen9 Wed 9:30 – 10:30 Level 2 Room 2008 GVCS002 Enhancing 4K Media Experience in Power Optimized Intel® Graphics, Gen9 Wed 11:00 – 12:00 Level 2 Room 2008 GVCS003 Display Stack Power Optimizations for Intel® Graphics, Gen9 Wed 5:15 – 6:15 Level 2 Room 2007 GVCS004 3D Optimizations for Intel® Graphics Technology, Gen9 Thurs 9:30 – 10:30 Level 2 Room 2007 GVCS005 Virtualized and Remote Intel® Processor Graphics for Your Data Center Thurs 10:45 – 11:45 Level 2 Room 2007 GVCS006 Scaling Energy Efficient Media Performance on Intel® Processor Graphics Thurs 1:00 – 2:00 Level 2 Room 2007 = DONE 50 Day Visual Experience Pavilion Drawing Bring your pre-stamped passport card to the Visual Experience Pavilion to be entered into our daily drawing! PAVILION DRAWING TIMES • • • 51 Tuesday, Aug. 18th – 6:00 p.m. Wednesday, Aug. 19th – 6:30 p.m. Thursday, Aug. 20th – 12:30 p.m. Increase your chances of winning! Visit two (2) pavilion demos and get additional ticket(s)! PRIZES EACH DAY 1 Intel® NUC 2 ASUS* ZenPad S 8 Tablets 1 Intel® Edison Kit for Arduino* MUST BE PRESENT TO WIN @ the Intel Visual Experience Pavilion Visual Experience Pavilion Demos Booth Number 52 Graphics Demo 220 Intel® Iris™ Mobile Gaming 221 Small Form Factor Intel® Iris™ Gaming 222 Virtualized and Remote Intel® Processor Graphics 223 4K Raw Image Video Processing on Intel® HD Graphics, Gen9 224 4K Multi-Display on Intel® HD Graphics, Gen9 225 Wireless Display Gaming with Intel® Quick Sync Video Fixed Function 226 4K HEVC Playback on Intel® Graphics, Gen9 Additional Sources of Information A PDF of this presentation is available from our Technical Session Catalog: www.intel.com/idfsessionsSF. Additional Online information: - Intel Processor Graphics: Architecture Whitepapers & Developer Guides The New Intel® Microarchitecture About Intel® Processor Graphics Technology Open source Linux documentation of Gen Graphics and Compute Architecture Intel® OpenCL™ SDK Intel® 64 and IA-32 Architectures Software Developers Manual Intel® Virtualization Technology for Directed I/O (VT-d): Enhancing Intel platforms for efficient virtualization of I/O devices - Intel® Virtualization Technology for Directed I/O - Architecture Specification - http://ark.intel.com/ 53 Completing an Online Session Evaluation by 10am Tomorrow Automatically Enters You in a Drawing to Win! You will receive an email with a link to the online evaluation prior to the end of this session. Day 3 Prize Day 2 Prize Day 1 Prize Intel® Compute Stick (20) Microsoft* Surface* 3 (6) Dell Venue 10 7000 Series (4) Winners will be notified by email Copies of the complete sweepstakes rules are available at the Info Desk 54 Q&A 55 What will you develop? 56 Legal Notices and Disclaimers Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. Intel and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. © 2015 Intel Corporation. 57 Risk Factors The above statements and any others in this document that refer to plans and expectations for the second quarter, the year and the future are forward-looking statements that involve a number of risks and uncertainties. Words such as "anticipates," "expects," "intends," "plans," "believes," "seeks," "estimates," "may," "will," "should" and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many factors could affect Intel's actual results, and variances from Intel's current expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following to be important factors that could cause actual results to differ materially from the company's expectations. Demand for Intel's products is highly variable and could differ from expectations due to factors including changes in business and economic conditions; consumer confidence or income levels; the introduction, availability and market acceptance of Intel's products, products used together with Intel products and competitors' products; competitive and pricing pressures, including actions taken by competitors; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes in the level of inventory at customers. Intel's gross margin percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and associated costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; and product manufacturing quality/yields. Variations in gross margin may also be caused by the timing of Intel product introductions and related expenses, including marketing expenses, and Intel's ability to respond quickly to technological developments and to introduce new products or incorporate new features into existing products, which may result in restructuring and asset impairment charges. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Results may also be affected by the formal or informal imposition by countries of new or revised export and/or import and doing-business regulations, which could be changed without prior notice. Intel operates in highly competitive industries and its operations have high costs that are either fixed or difficult to reduce in the short term. The amount, timing and execution of Intel's stock repurchase program could be affected by changes in Intel's priorities for the use of cash, such as operational spending, capital spending, acquisitions, and as a result of changes to Intel's cash flows or changes in tax laws. Product defects or errata (deviations from published specifications) may adversely impact our expenses, revenues and reputation. Intel's results could be affected by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues. An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel's ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. Intel's results may be affected by the timing of closing of acquisitions, divestitures and other significant transactions. A detailed discussion of these and other factors that could affect Intel's results is included in Intel's SEC filings, including the company's most recent reports on Form 10-Q, Form 10-K and earnings release. Rev. 4/14/15 58