Uploaded by king lin

masked-software-occlusion-culling

advertisement
Magnus Andersson
Occlusion Culling
View frustum
Eye
Stanford Bunny in the Crytek Sponza Atrium
2
Occlusion Culling
 Fully occluded
Stanford Bunny in the Crytek Sponza Atrium
3
Occlusion Culling
 Partially occluded
Stanford Bunny in the Crytek Sponza Atrium
4
Hardware Fixed-function Occlusion Culling
– Semi-occluded triangles can be
partially culled
 Very late in the pipeline
Game logic
Upload frame data
Draw call
GPU side
 Per-tile culling granularity
CPU side
 Handled automatically under the hood
Geometry processing
Z Tile Culling
Pixel processing
5
Software Occlusion Culling
– Cull both CPU and GPU work
 Short delay
Game logic +
CPU side
 Cull very early in the pipeline
SW culling
Upload frame data
Draw call
GPU side
– Can be integrated with scene traversal
Geometry processing
Z Tile Culling
Pixel processing
6
Potentially Visible Sets (PVS)
 Binary Space Partitioning (BSP)
trees & portals
 Precomputed – very efficient
 Scene (occluders) must be static
Quake II, id Software, 1997
 Difficult to handle general
scenes
Half-Life 2, Valve Corporation, 2004
7
Potentially Visible Sets (PVS)
Player
Leaf boundaries
Quake II, id Software, 1997
Not part of PVS
Half-Life 2, Valve Corporation, 2004
8
Dynamic Occlusion Culling
 Increasingly popular
 Modern games have more
complex and dynamic worlds
 No complex pre-computation
[HA15]
Assassin’s Creed Unity, Ubisoft, 2014
– Simpler content pipeline
[Col11]
Battlefield 4, EA DICE, 2013
9
Z-buffer Based Culling
Complex
object
Hierarchical Z Buffer (HiZ) [Greene93]
 Rasterize to full resolution z buffer
 Create HiZ buffer
– Find the maximum depth in each NxN tile
Bounding
shape
Full resolution depth buffer
 Perform occlusion query with HiZ buffer
 General algorithm works for both SW
and HW occlusion culling
HiZ buffer
Dragon model courtesy of Stanford University Computer Graphics Laboratory
10
Intel Software Occlusion Culling Framework
[CMK16]
Algorithm phases:
1. Rasterize a few designated occluder
objects to z buffer
– Heavily SSE/AVX optimized
– Parallel triangle setup
– Parallel pixel depth computation
2. Compute 1-level HiZ buffer (and throw
away z buffer)
3. Perform queries and render surviving
objects
11
Z-buffer Based Culling
 Rendering to z-buffer per pixel
 Updating HiZ tile needs all pixels within
the tile
 Occlusion Query per tile
Full resolution depth buffer
 Wouldn’t it be nice to compute HiZ
directly?
– Being conservative is the only requirement
 Idea: use alternative HiZ representation
HiZ buffer
12
Alternative HiZ buffer representation
Masked Occlusion Culling for Graphics Hardware [AHAM15]
 Two depth values per tile
 Per-pixel selection mask
0
0
0
0
0
0
0
0
0
zmax
1
zmax
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
0
0
0
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Layer selection mask
13
Masked Occlusion Culling [AHAM15]
14
Masked Occlusion Culling [AHAM15]
15
Masked Occlusion Culling [AHAM15]
16
Masked Occlusion Culling [AHAM15]
17
Masked Occlusion Culling [AHAM15]
18
Masked Occlusion Culling [AHAM15]
19
Masked Occlusion Culling [AHAM15]
?
Merge
20
Masked Occlusion Culling [AHAM15]
21
Masked Occlusion Culling [AHAM15]
Not culled
Culled
22
Masked Occlusion Culling [AHAM15]
Triangle meshes
23
Masked Occlusion Culling [AHAM15]
 Originally designed for graphics hardware
 Directly update HiZ buffer without
computing a full res z buffer
 Decouples coverage sampling
(rasterization) and depth computation
Depth buffer
Approximate, conservative HiZ buffer
24
Masked Software Occlusion Culling
Could Masked Occlusion Culling [AHAM15] be really fast for software
occlusion culling?
 Much less memory to read/write than full res z-buffer
 Updates use bitmasks – can process many pixels in parallel (i.e. SSE/AVX)
 No need to compute per-pixel depths
– Would need a fast SW rasterizer to compute coverage
Turns out it can 
 Paper presented at High Performance Graphics this year [HAAM16]
 Source code available!
25
Single Instruction, Multiple Data (SIMD)
x86
AVX
256 bits
32 bits
A
3
A
+
B
5
8
32 bits
32 bits
32 bits
32 bits
32 bits
32 bits
32 bits
32 bits
3
4
5
1
6
4
2
10
+ + + + + + + +
B
5
5
7
11
3
4
5
5
8
9
12
12
9
8
7
15
26
Single Instruction, Multiple Data (SIMD)
A
x86
AVX
32 bits
256 bits
0xAC1DBA5E
A
&
B
0x51CAFE37
0x0008BA16
0xAC1DBA5EAC1DBA5EAC1DBA5EAC1DBA5E51CAFE3751CAFE3751CAFE3751CAFE37
&
B
0x51CAFE3751CAFE3751CAFE3751CAFE37AC1DBA5EAC1DBA5EAC1DBA5EAC1DBA5E
0x0008BA160008BA160008BA160008BA160008BA160008BA160008BA160008BA16
27
An abridged history of Intel’s SIMD instruction sets
SSE2, 2001
AVX, 2011
256b wide
AVX-512, 2016
512b wide
2nd Gen Intel® Core™ Processors
1998
2017
SSE, 1999
128b wide
SSE4, 2006
AVX2, 2013
Intel® microarchitecture code name Nehalem
4th Gen Intel® Core™ Processors
New algorithm
Supported
Easily extended
in ourto
library
AVX-512
code
target architecture
28
Masked software occlusion culling
Algorithm Overview
Triangle
setup
Transform
Clip
Compute bounds
Depth plane
Traversal setup
Tile
traversal
8-wide triangle setup
8 scanlines
Compute coverage
Depth test
256 pixels (8 tiles with 8x4 pixels)
Update
30
Transform
Transform and Clip
Clip
Compute bounds
Depth plane
Traversal setup
Compute coverage
Depth test
Update
31
Transform
Compute Bounding Box
Clip
Compute bounds
Depth plane
 Padded to 32x8 pixel supertiles
Traversal setup
Compute coverage
Depth test
Update
32
Transform
Compute Depth Plane
Clip
Compute bounds
 Depth = ax + by + c
Depth plane
– Conservative tile depth: Check sign of a and b
+a
Compute coverage
Depth test
– Can be incrementally updated
+b
Traversal setup
Update
-, +
+, +
-, -
+, Clamp to vertex depths
33
Transform
Supertile Traversal Order
Clip
Compute bounds
Depth plane
Traversal setup
Compute coverage
Depth test
Update
34
Transform
AVX Register Layout
Clip
Compute bounds
Depth plane
Traversal setup
Compute coverage
Depth test
Update
35
Transform
AVX Register Layout
Clip
Compute bounds
Depth plane
 One scanline per SIMD lane
Traversal setup
Compute coverage
Depth test
Update
36
Transform
Edge Slopes
Clip
Compute bounds
Depth plane
 Compute slopes (∆y/∆x) once
– Similar to regular scanline rasterizers
Traversal setup
Compute coverage
Depth test
Update
37
Transform
Compute Intersections
Clip
Compute bounds
Depth plane
 Compute intersections for each scanline
Traversal setup
Compute coverage
Depth test
– Eight scanlines in parallel using AVX
Update
Intersections
38
Transform
Compute Coverage Mask
Clip
Compute bounds
Depth plane
 Start with full coverage mask
Traversal setup
Compute coverage
Depth test
Update
Intersections
39
Transform
Compute Coverage Mask
Clip
Compute bounds
Depth plane
 Start with full coverage mask
Traversal setup
– Shift each lane (scanline) to intersection
– AVX2 and later have per-lane shift instruction
Compute coverage
Depth test
Update
>>
>>
>>
>>
>>
>>
>>
>>
Intersections
40
Transform
Compute Coverage Mask
Clip
Compute bounds
 Repeat the same process for the next edge
Depth plane
Traversal setup
Compute coverage
Depth test
Update
Right edge
Left edge
Right edge
Intersections
41
Transform
Compute Coverage Mask
Clip
Compute bounds
 Repeat the same process for the next edge
Depth plane
Traversal setup
Compute coverage
– Edge is facing right  invert mask
Depth test
Update
Intersections
42
Transform
Compute Coverage Mask
 Combine masks of all overlapping edges
Clip
Compute bounds
Depth plane
Traversal setup
Compute coverage
Depth test
Update
43
Transform
Compute Coverage Mask
 Combine masks of all overlapping edges
– Using bitwise AND
Clip
Compute bounds
Depth plane
Traversal setup
Compute coverage
Depth test
Update
44
Transform
Compute Coverage Mask
 Combine masks of all overlapping edges
– Using bitwise AND
Clip
Compute bounds
Depth plane
Traversal setup
Compute coverage
Depth test
Update
45
Transform
Shuffle Mask
 Shuffle mask to form better shaped tiles
– Before: each SIMD lane is a scanline
Clip
Compute bounds
Depth plane
Traversal setup
Compute coverage
Depth test
Update
46
Transform
Shuffle Mask
 Shuffle mask to form better shaped tiles
– Before: each SIMD lane is a scanline
– After: each SIMD lane is a 8x4 tile
Clip
Compute bounds
Depth plane
Traversal setup
Compute coverage
Depth test
Update
47
Transform
Depth Test
 Interpolate conservative depth (per 8x4 tile)
 Test against buffer
Clip
Compute bounds
Depth plane
Traversal setup
Compute coverage
Depth test
Update
Buffer
48
Transform
Update Tile
Clip
Compute bounds
Depth plane
 Two code paths (can be switched compile time)
– Original update method [AHAM15]
Traversal setup
Compute coverage
Depth test
Update
– New update method tailored for SW [HAAM16]
 Why use a new update method?
– Faster – same culling power
– Less accurate than original, more dependent on render order
– Works best if you render front-to-back
49
Transform
Update Tile, New Method [HAAM16]
0
 zmax
is the reference layer
– Maximum value for the entire tile

1
zmax
is
the working layer
Clip
Compute bounds
Depth plane
Traversal setup
Compute coverage
Depth test
Update
– Maximum value for a subset of the tile
– Updated as
1 , z tri )
– New depth = max(zmax
max
– New mask = TriangleMask OR LayerMask
 Whenever working layer mask is full, overwrite reference layer
50
Update Tile
51
Update Tile
52
Update Tile
53
Update Tile
54
Update Tile
55
Update Tile
1
tri
0
1
 Discard heuristic: If zmax
– zmax
> zmax
– zmax
, discard working layer
Restart
56
Update Tile
57
Update Tile
58
Update Tile
Full overwrite:
Restart from new value
59
Transform
Update Tile
Clip
Compute bounds
Depth plane
 Update is quicker than original [AHAM15]
Traversal setup
Compute coverage
 Test is also quicker
– Need only to test against reference layer
Depth test
0 )
(zmax
Update
60
Results
Intel Occlusion Culling Sample
(μs)
Old [CMK16]
New [HAAM16]
16x
3.7x
 Clear: Clearing the depth buffer
 Geom: Transform & project geometry
 Rast: Triangle setup & occluder rasterization
 Gen: Compute HiZ buffer from full resolution z buffer
 Test: Perform occlusion queries
62
Results
Performance comparison for camera
animation
First frame
Frustum only
Old
New
Last frame
63
Code is available as open-source
Masked Occlusion Culling API
void SetResolution();
void SetNearClipPlane();
Setup
void ClearBuffer();
static void TransformVertices();
Result RenderTriangles();
Result TestTriangles();
Render &
query
Result TestRect();
void ComputePixelDepthBuffer();
OcclusionCullingStatistics GetStatistics();
Debug
65
Masked Occlusion Culling API
Result RenderTriangles(
float *inVtx,
// Clip space vertex positions
uint *inTris,
// Index array (Indices to inVtx buffer)
int nTris,
// Triangle count (the number of index triplets in inTris)
ClipPlanes mask,
// Mask for potential frustum bound overlap
ScissorRect *scissor,
// Scissor region
VertexLayout &layout
// Vertex format of inTris. There is a fast-path for AoS with
(x, y, z, w) coordinates
);
 Render to the software HiZ buffer
66
Masked Occlusion Culling API
mask = 0
Result RenderTriangles(
float *inVtx,
uint *inTris,
int nTris,
ClipPlanes mask,
mask = leftPlane | nearPlane
View frustum
ScissorRect *scissor,
VertexLayout &layout
Near plane
);
 Clipping is not free...
Eye
– If you’re already doing frustum culling, let the API know the outcome 
67
Masked Occlusion Culling API
Result RenderTriangles(
float *inVtx,
Scissor region
(screen space AABB)
uint *inTris,
int nTris,
ClipPlanes mask,
View frustum
ScissorRect *scissor,
VertexLayout &layout
);
 Can be used for threading
Eye
– One scissor region per thread
68
Masked Occlusion Culling API
Result TestTriangles(
// Returns the collective culling outcome of the triangles
float *inVtx,
// Clip space vertex positions
uint *inTris,
// Index array (Indices to inVtx buffer)
int nTris,
// Triangle count (the number of index triplets in inTris)
ClipPlanes mask,
// Mask for potential frustum bound overlap
ScissorRect *scissor,
// Scissor region
VertexLayout &layout
// Vertex format of inTris. There is a fast-path for AoS with
(x, y, z, w) coordinates
);
 Test triangles against the software HiZ buffer
– Does not update the buffer
69
Masked Occlusion Culling API
Result TestRect(
float xmin,
float ymin,
// Returns the culling outcome of the screen space rectangle
/*
Screen space bounds:
float xmax,
[xmin, ymin] – [xmax, ymax]
float ymax,
*/
float wmin
// Conservative clip space w (typically the w-component of the nearest
bbox vertex in clip space)
);
 Test rectangle against the software HiZ buffer
– Does not update the buffer
70
Example use case: Scene Bounding Volume Hierarchy (BVH) traversal and culling
RenderFrame
Culled!
ClearBuffer();
prioQueue.push(root);
while (!prioQueue.empty()) {
Node node = prioQueue.pop();
if (FrustumTest(node) == Culled)
continue;
compute_screen_space_bounds(node);
if (TestRect(bounds) == Culled)
continue;
if (node is InnerNode) {
prioQueue.push(node.left, dist);
prioQueue.push(node.right, dist);
} else (node is Leaf) {
TransformVertices(leaf.vertices);
RenderTriangles(xfVertices);
send_leaf_to_GPU();
}
}
71
Essential Tools We Have Relied On
 Intel® VTune™
– https://software.intel.com/en-us/intel-vtune-amplifier-xe
 SSE/AVX intrinsics guide
– https://software.intel.com/sites/landingpage/IntrinsicsGuide/
72
References
[AHAM15] ANDERSSON M., HASSELGREN J., AKENINE-MÖLLER T.: Masked Depth Culling for
Graphics Hardware. ACM Transactions on Graphics 34, 6 (2015), pp. 188:1–188:9
[CMK16] CHANDRASEKARAN C., MCNABB D., KUAH K., FAUCONNEAU M., GIESEN F.: Software
Occlusion Culling. Published online at: https://software.intel.com/en-us/articles/
software-occlusion-culling, (2013–2016)
[Col11] COLLIN D.: Culling the Battlefield. Game Developer’s Conference (presentation), (2011)
[Greene93] GREENE N., KASS M., MILLER G.: Hierarchical Z-Buffer Visibility. In Proceedings of
SIGGRAPH, (1993), pp. 231–238
[HA15] HAAR U., AALTONEN S.: GPU-Driven Rendering Pipelines. SIGGRAPH Advances in Real-Time
Rendering in Games course, (2015)
[HAAM16] HASSELGREN J., ANDERSSON M., AKENINE-MÖLLER T.: Masked Software Occlusion
Culling. High Performance Graphics, (2016)
73
Check it out!
 GitHub: Lightweight library
– https://github.com/GameTechDev/MaskedOcclusionCulling
 GitHub: Example integrated in Intel’s Software Occlusion Culling demo
– https://github.com/GameTechDev/OcclusionCulling
 Project page: Masked Software Occlusion Culling
– https://software.intel.com/en-us/articles/masked-software-occlusion-culling
 Questions and feedback welcome
– magnus.andersson@intel.com
74
Legal Notices and Disclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system
configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other
sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit
http://www.intel.com/performance.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For
more complete information visit http://www.intel.com/performance.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide
cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel
representative to obtain the latest forecast, schedule, specifications and roadmaps.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and
uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. The products described may contain
design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data
are accurate.
© 2016 Intel Corporation. Intel, the Intel logo, VTune and others are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
Download