Tessellation in a Low Poly World

advertisement
Tessellation in a Low Poly World
Nicolas Thibieroz
AMD Graphics Products Group
nicolas.thibieroz@amd.com
GDC Paris 2008
109/04/2015
Original materials from
Bill Bilodeau
What is Tessellation?
Tessellation is the process of adding new primitives into
an existing model
Triangle counts can be “dialed in” by adjusting the
tessellation level
Low
Medium
High
AMD Hardware Tessellator
Input Assembler
Vertex Shader
Rasterizer
Pixel Shader
Output Merger
/ Resources
Memory
/ Resources
Memory
Tessellator
Hardware tessellation allows you to
render more polygons for better
silhouettes
Initial concept artwork from
Bay Raitt, Valve
Surface control cages are easier to
work with than individual triangles
Artists prefer to create models this way
Animations are simpler on a control cage
Control cage can be animated on the GPU, then
tessellated in a second pass
Vertex
Shader
Tessellator
Animated
Control Cage
Pixel
Shader
Vertex
Shader
Pixel
Shader
R2VB
Hardware tessellation is a form of
compression
Smaller footprint – you only need to store the control
cage and possibly a displacement map
Improved bandwidth – less data to transfer from
memory to GPU
Three types of primitives, or
“superprims”, are supported
Triangles
Quads
Lines
There are two tessellation modes
-
Continuous
-
Adaptive
Continuous Tessellation
Specify floating point tessellation level per-draw call
– Tessellation levels range from 1.0 to 14.99
Eliminates popping as vertices are added through tessellation
Level 1.0
Level 2.0
Continuous Tessellation
Specify floating point tessellation level per-draw call
– Tessellation levels range from 1.0 to 14.99
Eliminates popping as vertices are added through tessellation
Level 1.0
Level = 1.3
1.7
1.0
2.0
1.1
Level 2.0
Adaptive allows different levels of
tessellation within the same mesh
Edge tessellation factor = 7.x
Edge tessellation factor = 5.x
Edge tessellation factor = 5.x
Adaptive tessellation can be done in
real-time using multiple passes
Superprim
Mesh
Vertex
Shader
Pixel
Shader
Sampler
Superprim
Mesh
Stream 1
Superprim
Mesh
Stream 0
Vertex
Shader
Tessellator
Transformed
Superprim
Mesh
R2VB
Pixel
Shader
Tessellation
Factors
Vertex
Shader
Pixel
Shader
Code Example: Continuous Tessellation
// Enable tessellation:
TSSetTessellationMode( pd3dDevice, TSMD_ENABLE_CONTINUOUS );
// Set tessellation level:
TSSetMaxTessellationLevel( pd3dDevice, sg_fMaxTessellationLevel );
// Select appropriate technique to render our tessellated objects:
sg_pEffect->SetTechnique( "RenderTessellatedDisplacedScene" );
// Render all passes with tessellation
V( sg_pEffect->Begin( &cPasses, 0 ) );
for ( iPass = 0; iPass < cPasses; iPass++ )
{
V( sg_pEffect->BeginPass( iPass ) );
V( TSDrawMeshSubset( sg_pMesh, 0 ) );
V( sg_pEffect->EndPass() );
}
V( sg_pEffect->End() );
// Disable tessellation:
TSSetTessellationMode( pd3dDevice, TSMD_DISABLE );
The vertex shader is used as an
evaluation shader
(Evaluation Shader)
Vertex
Shader
Tessellator
Sampler
Super-prim Mesh
Tessellated Mesh
Tessellated and
Displaced Mesh
Displacement
Map
Example Code: Evaluation Vertex Shader
struct VsInputTessellated
{
// Barycentric weights for this vertex
float3 vBarycentric: BLENDWEIGHT0;
// Data from superprim vertex 0:
float4 vPositionVert0 : POSITION0;
float2 vTexCoordVert0 : TEXCOORD0;
float3 vNormalVert0
: NORMAL0;
// Data from superprim vertex 1:
float4 vPositionVert1 : POSITION4;
float2 vTexCoordVert1 : TEXCOORD4;
float3 vNormalVert1
: NORMAL4;
// Data from superprim vertex 2:
float4 vPositionVert2 : POSITION8;
float2 vTexCoordVert2 : TEXCOORD8;
float3 vNormalVert2
};
: NORMAL8;
Example Code: Evaluation Vertex Shader
VsOutputTessellated VSRenderTessellatedDisplaced( VsInputTessellated i )
{
VsOutputTessellated o;
// Compute new position based on the barycentric coordinates:
float3 vPosTessOS = i.vPositionVert0.xyz * i.vBarycentric.x + i.vPositionVert1.xyz
i.vBarycentric.y + i.vPositionVert2.xyz * i.vBarycentric.z;
// Output world-space position:
o.vPositionWS = vPosTessOS;
// Compute new normal vector for the tessellated vertex:
o.vNormalWS
= i.vNormalVert0.xyz * i.vBarycentric.x + i.vNormalVert1.xyz * i.vBarycentric.y
+ i.vNormalVert2.xyz * i.vBarycentric.z;
// Compute new texture coordinates based on the barycentric coordinates:
o.vTexCoord = i.vTexCoordVert0.xy * i.vBarycentric.x + i.vTexCoordVert1.xy * i.vBarycentric.y
+ i.vTexCoordVert2.xy * i.vBarycentric.z;
// Displace the tessellated vertex (sample the displacement map)
o.vPositionWS = DisplaceVertex( vPosTessOS, o.vTexCoord, o.vNormalWS );
// Transform position to screen-space:
o.vPosCS = mul( float4( o.vPositionWS, 1.0 ), g_mWorldViewProjection );
return o;
}
// End of VsOutputTessellated VSRenderTessellatedDisplaced(..)
What if you want to do more?
DirectX 9 has a limit of 15 float4 vertex input
components – High order surfaces need more inputs
TSToggleIndicesRetrieval() allows you to fetch the
super-prim data from a vertex texture
Bezier Control Points
Tessellator
(u,v)
Vertex
Shader
P0,0, P0,1 … P3,3
Sampler
Other Tessellation Library Functions
TSDrawIndexed(…)
– Analogous to DrawIndexedPrimitive(…)
TSDrawNonIndexed(…)
– Needed for adaptive tessellation, since every edge needs its
own tessellation level
TSSetMinTessellationLevel(…)
– Sets the minimum tessellation level for adaptive
tessellation
TSComputeNumTessellatedPrimitives(…)
– Calculates the number of tessellated primitives that will be
generated by the tessellator
Displacement mapping alters
tangent space
To do normal mapping we need to rotate tangent space
Alternatively, use model space normal maps
 Doesn’t work with animation or tiling
Displacement map lighting
Use the displacement map to calculate the per-pixel
normal
 Central differencing with neighboring displacements can
approximate the derivative
Light with the computed normal
No need to use a normal map
Terrain Rendering: Performance Results
On-disk model polygon
count (pre-tessellation)
Low Resolution with
Tessellation
High Resolution,
No Tessellation
840 triangles
1,280,038 triangles
Original model
rendering cost
1210 fps (0.83 ms)
Actual rendered model
polygon count
1,008,038 triangles
1,280,038 triangles
VRAM Vertex buffer size
70 KB
31 MB
VRAM Index buffer size
23 KB
14 MB
Rendering time
821.41 fps (1.22 ms)
301 fps (3.32 ms)
Both use the same
displacement map (2K x 2K)
and identical pixel shaders
Rendering with tessellation is > 6X
faster and provides memory savings
over 44MB! Subtracting the cost of shading
Terrain Tessellation Sample
AMD GPU MeshMapper
New tool for generate normal, displacement,
and ambient occlusion maps from hi-res and
low-res mesh pairs
Advantages of the Tessellator
• Saves memory bandwidth and reduces memory
footprint
• Flexible support for displacement mapping and many
kinds of high order surfaces
• Easier content creation – artists and animators only
need to work with low resolution geometry
• Continuous LOD avoids unnecessary triangles
• The tessellator is available now on the Xbox 360 and
the latest ATI Radeon and FireGL graphics cards
•
Public availability of tessellation SDK very soon
Harnessing the Power of Multiple GPUs
Nicolas Thibieroz
AMD Graphics Products Group
nicolas.thibieroz@amd.com
GDC Paris 2008
2509/04/2015
Original materials from
Jon Story & Holger Grün
Why MGPU?
MGPUs can be used to dramatically increase performance
and visual quality
– At higher screen resolutions
– Especially with increased use of MSAA
Many applications become GPU limited at higher screen
resolutions
– High resolution monitors => mainstream affordability
Achieve next generation performance on today‘s HW
– Prototype your next engine
Provides an upgrade path for mainstream parts
09/04
/2015
26
Multiple Boards
An increasing number of
motherboards can accept 2 or
more discrete video cards
Connected by high speed
crossover cables
2x
Now possible to fit 4 Radeon
HD3850 boards to a single
motherboard
CrossFireX technology allows you
to harness that performance
4x
09/04
/2015
27
Multiple GPUs per Board
The Radeon HD3870 X2 is a
single-board multi-GPU
architecture
– AFR is on by default
Heavy peer to peer
communication
2x
– Bi-directional 16x lane pipe
connecting the 2 GPUs
CrossFireX supports 2 HD3870 X2
boards for Quad GPU performance
4x
09/04
/2015
28
Hybrid Crossfire
Combination of integrated
and discrete graphics
3D graphics performance
boost
– Laptops
– Mainstream desktop PCs
Use less power during nontaxing graphical tasks
09/04
/2015
29
CrossFire Rendering Modes
Split Frame Rendering / Scissor
– Screen is divided into number of GPUs
– Dynamic load balancing
Alternate Frame Rendering
– GPUs take alternate frames
– Vertex processing not duplicated
– Highest performing mode
09/04
/2015
30
How does AFR Work?
CPU
09/04
/2015
GPU0 (Frame N)
GPU1 (Frame N+1)
Command
Command
Command
Command
Command
Command
Command
Command
Command
Command
Command
Command
31
Hardware Considerations
Current MGPU setups are not shared memory
architectures
– Resources placed in local video memory are duplicated for
each GPU
Driver initiates peer to peer (P2P) copies to keep
resources in sync
– On some chipsets this may involve the CPU
– Synchronizes all GPUs
– Very heavy impact on performance that can even result in
negative scaling
09/04
/2015
32
Driver Modes
Compatible AFR Mode
– Default mode
– Driver checks for AFR unfriendly behaviour
– Will P2P copy stale resources
Full AFR Mode (Application Profile)
– Driver recognises EXE name
– Use a unique name and don‘t change it
– Behaviour fully guided by profile
– Best performance – no checking
– Rename EXE to “AFR-FriendlyD3D.exe“
– Use “AFR-FriendlyOGL.exe“ for OpenGL
– No checking : Speed & compatibility test
09/04
/2015
33
Detecting the Number of GPUs
Visit http://ati.amd.com/developer
– Download project called “CrossFire Detect“
Statically link to:
– “atimgpud_s_x86.lib“ 32 bit version
– “atimgpud_s_x64.lib“ 64 bit version
Include header file:
– “atimgpud.h“
Call this function:
– INT count = AtiMultiGPUAdapters();
09/04
/2015
34
Common Pitfalls & Solutions
09/04
/2015
35
Pitfall: Dependencies Between Frames
resource A
GPU0 (Frame N)
resource A
GPU1 (Frame N+1)
Present (N-1)
Draw using A
Update resource A
Draw using A
P2P copy from GPU0 to GPU1
Update resource A
Present (N)
Present (N+1)
36
09/04/2015
Solution: Resources that Change Every
Frame
resource A
GPU0 (Frame N)
resource A
GPU1 (Frame N+1)
Present (N-1)
Update resource A
Draw using A
Present (N)
Update resource A
Draw using A
Present (N+1)
There are no P2P copies if one always
modifies the resource before using it within
a frame !
09/04
/2015
37
Solution: Resources that Change Every
Few Frames
resource A
GPU0 (Frame N)
resource A
GPU1 (Frame N+1)
Present (N-1)
Update resource A
Draw using A
Present (N)
Draw using A
Update resource A
Draw using A
Present (N+1)
Draw using A
Presentthe
(N+2)
Repeat
modification for N GPU frames to
Present (N+3)
ensure
that
each
GPU
has
the
same
data! No
Draw using A
P2P copies will happen!
Present (N+4)
09/04
/2015
38
Pitfalls: In DX10 there are Other Ways
to Update Resources...
Drawing to vertex/index buffers
Stream Out
CopyResource() calls
CopySubresourceRegion() calls
GenerateMips() calls
ResolveSubresource() calls
09/04
/2015
39
Pitfall: Waiting on Queries
CPU
Waiting for Query Result!!!
GPU0 (Frame N)
GPU1 (Frame N+1)
Command
Command
Command
Command
Command
Command
Command
Command
Waiting starves GPU queues
Waiting limits parallelism
Command
Command
Waiting
=> CPU limitation
Command
Command
09/04
/2015
40
Solution: Queries
Avoid using queries whenever possible
 - For occlusion queries consider a CPU-based
approach
Avoid waiting on query results
 - Pick up the result of a query at least N-GPU
frames after it was issued
For queries issued every frame
 - Create additional query objects for each GPU
 - Cycle through them
Pitfall: CPU Access to a Renderable
Resource
When the CPU locks a renderable resource it must wait
for all GPUs to finish using the resource before acquiring
the pointer
All GPUs now have to wait until the CPU unlocks the
resource pointer
After the unlock the driver has to update the resource on
each GPU via P2P copies
Just don‘t do this – it destroys performance even on a
single GPU setup, and is catastrophic for MGPUs
09/04
/2015
42
Solutions: Locks / Maps
In DX10 stream to and copy from STAGING textures
In DX9 StretchRect() is always better than Lock()
At resource creation time use the appropriate flags from:
– D3D10_USAGE
– D3D10_CPU_ACCESS_FLAG
In DX9 never lock static Vertex/Index Buffers because it
will cause P2P copies
09/04
/2015
43
Concluding Pitfalls & Solutions
Drivers take a conservative approach
– Performs checks on resource synchronization
– P2P copy if necessary
You know the application best
– Determine if a P2P copy is necessary
– Talk to us about a profile
09/04
/2015
44
AFR-Friendly SDK Sample
Part of the ATI developer SDK
– http://ati.amd.com/developer
Detects the number of GPUs
Correctly deals with textures used as render targets
Provides a solution for dealing with mouse cursor lag
Go and take a look!!
09/04
/2015
45
Call to Action
• MGPUs provide demonstrable performance gains
• MGPUs boost visual quality
• Plan from day one to make your rendering scale
• Detect the number of GPUs
• Regularly check for AFR unfriendly behavior
• Talk to us...
09/04
/2015
46
QUESTIONS?
nicolas.thibieroz@amd.com
Download