Magnus Andersson Occlusion Culling View frustum Eye Stanford Bunny in the Crytek Sponza Atrium 2 Occlusion Culling Fully occluded Stanford Bunny in the Crytek Sponza Atrium 3 Occlusion Culling Partially occluded Stanford Bunny in the Crytek Sponza Atrium 4 Hardware Fixed-function Occlusion Culling – Semi-occluded triangles can be partially culled Very late in the pipeline Game logic Upload frame data Draw call GPU side Per-tile culling granularity CPU side Handled automatically under the hood Geometry processing Z Tile Culling Pixel processing 5 Software Occlusion Culling – Cull both CPU and GPU work Short delay Game logic + CPU side Cull very early in the pipeline SW culling Upload frame data Draw call GPU side – Can be integrated with scene traversal Geometry processing Z Tile Culling Pixel processing 6 Potentially Visible Sets (PVS) Binary Space Partitioning (BSP) trees & portals Precomputed – very efficient Scene (occluders) must be static Quake II, id Software, 1997 Difficult to handle general scenes Half-Life 2, Valve Corporation, 2004 7 Potentially Visible Sets (PVS) Player Leaf boundaries Quake II, id Software, 1997 Not part of PVS Half-Life 2, Valve Corporation, 2004 8 Dynamic Occlusion Culling Increasingly popular Modern games have more complex and dynamic worlds No complex pre-computation [HA15] Assassin’s Creed Unity, Ubisoft, 2014 – Simpler content pipeline [Col11] Battlefield 4, EA DICE, 2013 9 Z-buffer Based Culling Complex object Hierarchical Z Buffer (HiZ) [Greene93] Rasterize to full resolution z buffer Create HiZ buffer – Find the maximum depth in each NxN tile Bounding shape Full resolution depth buffer Perform occlusion query with HiZ buffer General algorithm works for both SW and HW occlusion culling HiZ buffer Dragon model courtesy of Stanford University Computer Graphics Laboratory 10 Intel Software Occlusion Culling Framework [CMK16] Algorithm phases: 1. Rasterize a few designated occluder objects to z buffer – Heavily SSE/AVX optimized – Parallel triangle setup – Parallel pixel depth computation 2. Compute 1-level HiZ buffer (and throw away z buffer) 3. Perform queries and render surviving objects 11 Z-buffer Based Culling Rendering to z-buffer per pixel Updating HiZ tile needs all pixels within the tile Occlusion Query per tile Full resolution depth buffer Wouldn’t it be nice to compute HiZ directly? – Being conservative is the only requirement Idea: use alternative HiZ representation HiZ buffer 12 Alternative HiZ buffer representation Masked Occlusion Culling for Graphics Hardware [AHAM15] Two depth values per tile Per-pixel selection mask 0 0 0 0 0 0 0 0 0 zmax 1 zmax 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Layer selection mask 13 Masked Occlusion Culling [AHAM15] 14 Masked Occlusion Culling [AHAM15] 15 Masked Occlusion Culling [AHAM15] 16 Masked Occlusion Culling [AHAM15] 17 Masked Occlusion Culling [AHAM15] 18 Masked Occlusion Culling [AHAM15] 19 Masked Occlusion Culling [AHAM15] ? Merge 20 Masked Occlusion Culling [AHAM15] 21 Masked Occlusion Culling [AHAM15] Not culled Culled 22 Masked Occlusion Culling [AHAM15] Triangle meshes 23 Masked Occlusion Culling [AHAM15] Originally designed for graphics hardware Directly update HiZ buffer without computing a full res z buffer Decouples coverage sampling (rasterization) and depth computation Depth buffer Approximate, conservative HiZ buffer 24 Masked Software Occlusion Culling Could Masked Occlusion Culling [AHAM15] be really fast for software occlusion culling? Much less memory to read/write than full res z-buffer Updates use bitmasks – can process many pixels in parallel (i.e. SSE/AVX) No need to compute per-pixel depths – Would need a fast SW rasterizer to compute coverage Turns out it can Paper presented at High Performance Graphics this year [HAAM16] Source code available! 25 Single Instruction, Multiple Data (SIMD) x86 AVX 256 bits 32 bits A 3 A + B 5 8 32 bits 32 bits 32 bits 32 bits 32 bits 32 bits 32 bits 32 bits 3 4 5 1 6 4 2 10 + + + + + + + + B 5 5 7 11 3 4 5 5 8 9 12 12 9 8 7 15 26 Single Instruction, Multiple Data (SIMD) A x86 AVX 32 bits 256 bits 0xAC1DBA5E A & B 0x51CAFE37 0x0008BA16 0xAC1DBA5EAC1DBA5EAC1DBA5EAC1DBA5E51CAFE3751CAFE3751CAFE3751CAFE37 & B 0x51CAFE3751CAFE3751CAFE3751CAFE37AC1DBA5EAC1DBA5EAC1DBA5EAC1DBA5E 0x0008BA160008BA160008BA160008BA160008BA160008BA160008BA160008BA16 27 An abridged history of Intel’s SIMD instruction sets SSE2, 2001 AVX, 2011 256b wide AVX-512, 2016 512b wide 2nd Gen Intel® Core™ Processors 1998 2017 SSE, 1999 128b wide SSE4, 2006 AVX2, 2013 Intel® microarchitecture code name Nehalem 4th Gen Intel® Core™ Processors New algorithm Supported Easily extended in ourto library AVX-512 code target architecture 28 Masked software occlusion culling Algorithm Overview Triangle setup Transform Clip Compute bounds Depth plane Traversal setup Tile traversal 8-wide triangle setup 8 scanlines Compute coverage Depth test 256 pixels (8 tiles with 8x4 pixels) Update 30 Transform Transform and Clip Clip Compute bounds Depth plane Traversal setup Compute coverage Depth test Update 31 Transform Compute Bounding Box Clip Compute bounds Depth plane Padded to 32x8 pixel supertiles Traversal setup Compute coverage Depth test Update 32 Transform Compute Depth Plane Clip Compute bounds Depth = ax + by + c Depth plane – Conservative tile depth: Check sign of a and b +a Compute coverage Depth test – Can be incrementally updated +b Traversal setup Update -, + +, + -, - +, Clamp to vertex depths 33 Transform Supertile Traversal Order Clip Compute bounds Depth plane Traversal setup Compute coverage Depth test Update 34 Transform AVX Register Layout Clip Compute bounds Depth plane Traversal setup Compute coverage Depth test Update 35 Transform AVX Register Layout Clip Compute bounds Depth plane One scanline per SIMD lane Traversal setup Compute coverage Depth test Update 36 Transform Edge Slopes Clip Compute bounds Depth plane Compute slopes (∆y/∆x) once – Similar to regular scanline rasterizers Traversal setup Compute coverage Depth test Update 37 Transform Compute Intersections Clip Compute bounds Depth plane Compute intersections for each scanline Traversal setup Compute coverage Depth test – Eight scanlines in parallel using AVX Update Intersections 38 Transform Compute Coverage Mask Clip Compute bounds Depth plane Start with full coverage mask Traversal setup Compute coverage Depth test Update Intersections 39 Transform Compute Coverage Mask Clip Compute bounds Depth plane Start with full coverage mask Traversal setup – Shift each lane (scanline) to intersection – AVX2 and later have per-lane shift instruction Compute coverage Depth test Update >> >> >> >> >> >> >> >> Intersections 40 Transform Compute Coverage Mask Clip Compute bounds Repeat the same process for the next edge Depth plane Traversal setup Compute coverage Depth test Update Right edge Left edge Right edge Intersections 41 Transform Compute Coverage Mask Clip Compute bounds Repeat the same process for the next edge Depth plane Traversal setup Compute coverage – Edge is facing right invert mask Depth test Update Intersections 42 Transform Compute Coverage Mask Combine masks of all overlapping edges Clip Compute bounds Depth plane Traversal setup Compute coverage Depth test Update 43 Transform Compute Coverage Mask Combine masks of all overlapping edges – Using bitwise AND Clip Compute bounds Depth plane Traversal setup Compute coverage Depth test Update 44 Transform Compute Coverage Mask Combine masks of all overlapping edges – Using bitwise AND Clip Compute bounds Depth plane Traversal setup Compute coverage Depth test Update 45 Transform Shuffle Mask Shuffle mask to form better shaped tiles – Before: each SIMD lane is a scanline Clip Compute bounds Depth plane Traversal setup Compute coverage Depth test Update 46 Transform Shuffle Mask Shuffle mask to form better shaped tiles – Before: each SIMD lane is a scanline – After: each SIMD lane is a 8x4 tile Clip Compute bounds Depth plane Traversal setup Compute coverage Depth test Update 47 Transform Depth Test Interpolate conservative depth (per 8x4 tile) Test against buffer Clip Compute bounds Depth plane Traversal setup Compute coverage Depth test Update Buffer 48 Transform Update Tile Clip Compute bounds Depth plane Two code paths (can be switched compile time) – Original update method [AHAM15] Traversal setup Compute coverage Depth test Update – New update method tailored for SW [HAAM16] Why use a new update method? – Faster – same culling power – Less accurate than original, more dependent on render order – Works best if you render front-to-back 49 Transform Update Tile, New Method [HAAM16] 0 zmax is the reference layer – Maximum value for the entire tile 1 zmax is the working layer Clip Compute bounds Depth plane Traversal setup Compute coverage Depth test Update – Maximum value for a subset of the tile – Updated as 1 , z tri ) – New depth = max(zmax max – New mask = TriangleMask OR LayerMask Whenever working layer mask is full, overwrite reference layer 50 Update Tile 51 Update Tile 52 Update Tile 53 Update Tile 54 Update Tile 55 Update Tile 1 tri 0 1 Discard heuristic: If zmax – zmax > zmax – zmax , discard working layer Restart 56 Update Tile 57 Update Tile 58 Update Tile Full overwrite: Restart from new value 59 Transform Update Tile Clip Compute bounds Depth plane Update is quicker than original [AHAM15] Traversal setup Compute coverage Test is also quicker – Need only to test against reference layer Depth test 0 ) (zmax Update 60 Results Intel Occlusion Culling Sample (μs) Old [CMK16] New [HAAM16] 16x 3.7x Clear: Clearing the depth buffer Geom: Transform & project geometry Rast: Triangle setup & occluder rasterization Gen: Compute HiZ buffer from full resolution z buffer Test: Perform occlusion queries 62 Results Performance comparison for camera animation First frame Frustum only Old New Last frame 63 Code is available as open-source Masked Occlusion Culling API void SetResolution(); void SetNearClipPlane(); Setup void ClearBuffer(); static void TransformVertices(); Result RenderTriangles(); Result TestTriangles(); Render & query Result TestRect(); void ComputePixelDepthBuffer(); OcclusionCullingStatistics GetStatistics(); Debug 65 Masked Occlusion Culling API Result RenderTriangles( float *inVtx, // Clip space vertex positions uint *inTris, // Index array (Indices to inVtx buffer) int nTris, // Triangle count (the number of index triplets in inTris) ClipPlanes mask, // Mask for potential frustum bound overlap ScissorRect *scissor, // Scissor region VertexLayout &layout // Vertex format of inTris. There is a fast-path for AoS with (x, y, z, w) coordinates ); Render to the software HiZ buffer 66 Masked Occlusion Culling API mask = 0 Result RenderTriangles( float *inVtx, uint *inTris, int nTris, ClipPlanes mask, mask = leftPlane | nearPlane View frustum ScissorRect *scissor, VertexLayout &layout Near plane ); Clipping is not free... Eye – If you’re already doing frustum culling, let the API know the outcome 67 Masked Occlusion Culling API Result RenderTriangles( float *inVtx, Scissor region (screen space AABB) uint *inTris, int nTris, ClipPlanes mask, View frustum ScissorRect *scissor, VertexLayout &layout ); Can be used for threading Eye – One scissor region per thread 68 Masked Occlusion Culling API Result TestTriangles( // Returns the collective culling outcome of the triangles float *inVtx, // Clip space vertex positions uint *inTris, // Index array (Indices to inVtx buffer) int nTris, // Triangle count (the number of index triplets in inTris) ClipPlanes mask, // Mask for potential frustum bound overlap ScissorRect *scissor, // Scissor region VertexLayout &layout // Vertex format of inTris. There is a fast-path for AoS with (x, y, z, w) coordinates ); Test triangles against the software HiZ buffer – Does not update the buffer 69 Masked Occlusion Culling API Result TestRect( float xmin, float ymin, // Returns the culling outcome of the screen space rectangle /* Screen space bounds: float xmax, [xmin, ymin] – [xmax, ymax] float ymax, */ float wmin // Conservative clip space w (typically the w-component of the nearest bbox vertex in clip space) ); Test rectangle against the software HiZ buffer – Does not update the buffer 70 Example use case: Scene Bounding Volume Hierarchy (BVH) traversal and culling RenderFrame Culled! ClearBuffer(); prioQueue.push(root); while (!prioQueue.empty()) { Node node = prioQueue.pop(); if (FrustumTest(node) == Culled) continue; compute_screen_space_bounds(node); if (TestRect(bounds) == Culled) continue; if (node is InnerNode) { prioQueue.push(node.left, dist); prioQueue.push(node.right, dist); } else (node is Leaf) { TransformVertices(leaf.vertices); RenderTriangles(xfVertices); send_leaf_to_GPU(); } } 71 Essential Tools We Have Relied On Intel® VTune™ – https://software.intel.com/en-us/intel-vtune-amplifier-xe SSE/AVX intrinsics guide – https://software.intel.com/sites/landingpage/IntrinsicsGuide/ 72 References [AHAM15] ANDERSSON M., HASSELGREN J., AKENINE-MÖLLER T.: Masked Depth Culling for Graphics Hardware. ACM Transactions on Graphics 34, 6 (2015), pp. 188:1–188:9 [CMK16] CHANDRASEKARAN C., MCNABB D., KUAH K., FAUCONNEAU M., GIESEN F.: Software Occlusion Culling. Published online at: https://software.intel.com/en-us/articles/ software-occlusion-culling, (2013–2016) [Col11] COLLIN D.: Culling the Battlefield. Game Developer’s Conference (presentation), (2011) [Greene93] GREENE N., KASS M., MILLER G.: Hierarchical Z-Buffer Visibility. In Proceedings of SIGGRAPH, (1993), pp. 231–238 [HA15] HAAR U., AALTONEN S.: GPU-Driven Rendering Pipelines. SIGGRAPH Advances in Real-Time Rendering in Games course, (2015) [HAAM16] HASSELGREN J., ANDERSSON M., AKENINE-MÖLLER T.: Masked Software Occlusion Culling. High Performance Graphics, (2016) 73 Check it out! GitHub: Lightweight library – https://github.com/GameTechDev/MaskedOcclusionCulling GitHub: Example integrated in Intel’s Software Occlusion Culling demo – https://github.com/GameTechDev/OcclusionCulling Project page: Masked Software Occlusion Culling – https://software.intel.com/en-us/articles/masked-software-occlusion-culling Questions and feedback welcome – magnus.andersson@intel.com 74 Legal Notices and Disclaimers Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K. All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. © 2016 Intel Corporation. Intel, the Intel logo, VTune and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.