Cg and Hardware Accelerated Shading Cem Cebenoyan

advertisement
Cg and Hardware Accelerated
Shading
Cem Cebenoyan
Overview
Cg Overview
Where we are in hardware today
Physical Simulation on GPU
GeforceFX / Cg Demos
Advanced hair and skin rendering in “Dawn”
Adaptive subdivision surfaces and ambient
occlusion shading in “Ogre”
Procedural shading in “Time Machine”
Depth of field and post-processing effects in “Toys”
OIT
NVIDIA CONFIDENTIAL
What is Cg?
A high level language for controlling parts of the
graphics pipeline of modern GPUs
Today, this includes the vertex transformation and
fragment processing units of the pipeline
Very C-like
Only simpler
Native support for vectors, matrices, dot-products,
reflection vectors, etc.
Similar in scope to Renderman
But notably different to handle the way hardware
accelerators work
NVIDIA CONFIDENTIAL
Cg Pipeline Overview
Graphics Program
Written in Cg
“C” for Graphics
Compiled & Optimized
Low Level, Graphics
“Assembly Code”
NVIDIA CONFIDENTIAL
Graphics Data Flow
Application
Vertex
Program
Fragment
Program
Cg Program
Cg Program
//
// Diffuse lighting
//
float d = dot (normalize(frag.N), normalize(frag.L));
if (d < 0)
d = 0;
c = d * f4tex2D( t, frag.uv ) * diffuse;
…
NVIDIA CONFIDENTIAL
Framebuffer
Graphics Hardware Today
Fully programmable vertex processing
Full IEEE 32-bit floating point processing
Native support for mul, dp3, dp4, rsq, pow, sin, cos...
Full support for branching, looping, subroutines
Fully programmable pixel processing
IEEE 32-bit, 16-bit (s10e5) math supported
Same native math ops as vertex, plus texture fetch,
and derivative instructions
No branching, but >1000 instruction limit
Floating point textures / frame buffers
No blending / filtering yet
~500mhz core clock
NVIDIA CONFIDENTIAL
Physical Simulation
Simple cellular automata-like simulations are
possible on NV20 class hardware (e.g. Game of
Life, Greg James’ water simulation, Mark Harris’
CML work)
Use textures to represent physical quantities (e.g.
displacement, velocity, force) on a regular grid
Multiple texture lookups allow access to
neighbouring values
Pixel shader calculates new values, renders results
back to texture
Each rendering pass draws a single quad,
calculating next time step in simulation
NVIDIA CONFIDENTIAL
Physical Simulation
Problem: 8 bit precision on NV20 is not enough,
causes drifting, stability problems
Float precision on NV30 allows GPU physics to
match CPU accuracy
New fragment programming model (longer
programs, flexible dependent texture reads) allows
much more interesting simulations
NVIDIA CONFIDENTIAL
Example: Cloth Simulation Shader
Uses Verlet integration (see: Jakobsen, GDC 2001)
Avoids storing explicit velocity
newx = x + (x – oldx)*damping + a*dt*dt
Not always accurate, but stable!
Store current and previous position of each
particle in 2 RGB float textures
Fragment program calculates new position, writes
result to float buffer
Copy float buffer back to texture for next iteration
(could use render-to-texture instead)
Swap current and previous textures
NVIDIA CONFIDENTIAL
Cloth Shader Demo
NVIDIA CONFIDENTIAL
Cloth Simulation Shader
2 passes:
1. Perform integration
2. Apply constraints:
Floor constraint
Sphere constraint
Distance constraints between particles
Read back float frame buffer using glReadPixels
Draw particles and constraints
NVIDIA CONFIDENTIAL
Cloth Simulation Cg Code (1st pass)
void Integrate(inout float3 x, float3 oldx, float3 a, float timestep2, float damping)
{
x = x + damping*(x - oldx) + a*timestep2;
}
myFragout main(v2fconnector In,
uniform texobjRECT x_tex,
uniform texobjRECT ox_tex,
uniform float timestep,
uniform float damping,
uniform float3 gravity)
{
myFragout Out;
float2 s = In.TEX0.xy;
// get current and previous position
float3 x =
f3texRECT(x_tex, s);
float3 oldx = f3texRECT(ox_tex, s);
// move the particle
Integrate(x, oldx, gravity, timestep*timestep, damping);
Out.COL.xyz = x;
return Out;
}
NVIDIA CONFIDENTIAL
Cloth Simulation Cg Code (2nd pass)
// constrain particle to be fixed distance from another particle
void DistanceConstraint(float3 x, inout float3 newx, float3 x2,
float restlength, float stiffness)
{
float3 delta = x2 - x;
float deltalength = length(delta);
float diff = (deltalength - restlength) / deltalength;
newx = newx + delta*stiffness*diff;
}
// constraint particle to be outside sphere
void SphereConstraint(inout float3 x, float3 center, float r)
{
float3 delta = x - center;
float dist = length(delta);
if (dist < r) {
x = center + delta*(r / dist);
}
}
// constrain particle to be above floor
void FloorConstraint(inout float3 x, float level)
{
if (x.y < level) {
x.y = level;
}
}
NVIDIA CONFIDENTIAL
Cloth Simulation Cg Code (cont.)
myFragout main(v2fconnector In,
uniform texobjRECT x_tex,
uniform texobjRECT ox_tex,
uniform float dist,
uniform float stiffness)
{
myFragout Out;
float2 s = In.TEX0.xy;
// get current position
float3 x = f3texRECT(x_tex, s);
// satisfy constraints
FloorConstraint(x, 0.0f);
SphereConstraint(x, float3(0.0, 2.0, 0.0), 1.0f);
// get positions of neighbouring particles
float3 x1 = f3texRECT(x_tex, s + float2(1.0, 0.0) );
float3 x2 = f3texRECT(x_tex, s + float2(-1.0, 0.0) );
float3 x3 = f3texRECT(x_tex, s + float2(0.0, 1.0) );
float3 x4 = f3texRECT(x_tex, s + float2(0.0, -1.0) );
// apply distance constraints
float3 newx = x;
if (s.x < 31) DistanceConstraint(x, newx, x1, dist, stiffness);
if (s.x > 0) DistanceConstraint(x, newx, x2, dist, stiffness);
if (s.y < 31) DistanceConstraint(x, newx, x3, dist, stiffness);
if (s.y > 0) DistanceConstraint(x, newx, x4, dist, stiffness);
Out.COL.xyz = newx;
return Out;
}
NVIDIA CONFIDENTIAL
Physical Simulation – Future Work
Limitation - only one destination buffer, can only
modify position of one particle at a time
Could use pack instructions to store 2 vec4h (8
half floats) in 128 bit float buffer
Could also use additional textures to encode
particle masses, stiffness, constraints between
arbitrary particles (rigid bodies)
“float buffer to vertex array” extension offers
possibility of directly interpreting results as
geometry without any CPU intervention!
Collision detection with meshes is hard
NVIDIA CONFIDENTIAL
Demos Introduction
Developed 4 demos for the launch of GeForce FX
“Dawn”
“Toys”
“Time Machine”
“Ogre”
(Spellcraft Studio)
NVIDIA CONFIDENTIAL
Characters Look Better With Hair
NVIDIA CONFIDENTIAL
Rendering Hair
Two options:
1) Volumetric (texture)
2) Geometric (lines)
We have used volumetric approximations (shells
and fins) in the past (e.g. Wolfman demo)
Doesn’t work well for long hair
We considered using textured ribbons (popular in
Japanese video games). Alpha sorting is a pain.
Performance of GeForce FX finally lets us render
hair as geometry
NVIDIA CONFIDENTIAL
Rendering Hair as Lines
Each hair strand is rendered as a line strip (2-20
vertices, depending on curvature)
Problem: lines are a minimum of 1 pixel thick,
regardless of distance from camera
Not possible to change line width per vertex
Can use camera-facing triangle strips, but these
require twice the number of vertices, and have
aliasing problems
NVIDIA CONFIDENTIAL
Anti-Aliasing
Two methods of anti-aliasing lines in OpenGL
GL_LINE_SMOOTH
High quality, but requires blending, sorting geometry
GL_MULTISAMPLE
Usually lower quality, but order independent
We used multisample anti-aliasing with “alpha to
coverage” mode
By fading alpha to zero at the ends of hairs,
coverage and apparent thickness decreases
“SAMPLE_ALPHA_TO_COVERAGE_ARB” is part
of the ARB_multisample extension
NVIDIA CONFIDENTIAL
Hair Without Antialiasing
NVIDIA CONFIDENTIAL
Hair With Multisample Antialiasing
NVIDIA CONFIDENTIAL
Hair Shading
Hair is lit with simple anisotropic shader (Heidrich
and Seidel model)
Low specular exponent, dim highlight looks best
Black hair = no shadows!
Self-shadowing hair is hard
Deep shadow maps
Opacity shadow maps
Top of head is painted black to avoid skin showing
through
We also had a very short hair style, which helps
NVIDIA CONFIDENTIAL
Hair Styling is Important
NVIDIA CONFIDENTIAL
Hair Styling
Difficult to position 50,000 individual curves by
hand
Typical solution is to define a small number of
control hairs, which are then interpolated across
the surface to produce render hairs
We developed a custom tool for hair styling
Commercial hair applications have poor styling
tools and are not designed for real time output
NVIDIA CONFIDENTIAL
Hair Styling
Scalp is defined as a polygon mesh
Hairs are represented as cubic Bezier curves
Controls hairs are defined for each vertex
Render hairs are interpolated across triangles
using barycentric coordinates
Number of generated hairs is based on triangle
area to maintain constant density
Can add noise to interpolated hairs to add variation
NVIDIA CONFIDENTIAL
Hair Styling Tool
Provides a simple UI for styling hair
Combing tools
Lengthen / shorten
Straighten / mess up
Uses a simple physics simulation based on Verlet
integration (Jakobson, GDC 2001)
Physics is run on control hairs only
Collision detection done with ellipsoids
NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
Dawn Demo
Show demo
NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
The Ogre Demo
A real-time preview of Spellcraft Studio’s inproduction short movie “Yeah!”
Created in 3DStudio MAX
Used Character Studio for animation, plus Stitch
plug-in for cloth simulation
Original movie was rendered in Brazil with global
illumination
Available at: www.yeahthemovie.de
Our aim was to recreate the original as closely as
possible, in real-time
NVIDIA CONFIDENTIAL
What are Subdivision Surfaces?
A curved surface defined as the limit of repeated
subdivision steps on a polygonal model
Subdivision rules create new vertices, edges, faces
based on neighboring features
We used the Catmull-Clark subdivision scheme (as
used by Pixar)
MAX, Maya, Softimage, Lightwave all support
forms of subdivision surfaces
NVIDIA CONFIDENTIAL
Realtime Adaptive Tessellation
Brute force subdivision is expensive
Generates lots of polygons where they aren’t needed
Number of polygons increases exponentially with
each subdivision
Adaptive tessellation subdivides patches based on
screen-space patch size test
Guaranteed crack-free
Generates normals and tangents on the fly
Culls off-screen and back-facing patches
CPU-based (uses SSE were possible)
NVIDIA CONFIDENTIAL
Control Mesh vs. Subdivided Mesh
4000 faces
NVIDIA CONFIDENTIAL
17,000 triangles
Control Mesh Detail
NVIDIA CONFIDENTIAL
Subdivided Mesh Detail
NVIDIA CONFIDENTIAL
Why Use Subdivision Surfaces?
Content
Characters were modeled with subdivision in mind
(using 3DSMax “MeshSmooth/NURMS” modifier)
Scalability
wanted demo to be scalable to lower-end hardware
“Infinite” detail
Can zoom in forever without seeing hard edges
Animation compression
Just store low-res control mesh for each frame
May be accelerated on future GPUs
NVIDIA CONFIDENTIAL
Disadvantages of Realtime Subdivision
CPU intensive
But we might as well use the CPU for something!
View dependent
Requires re-tessellation for shadow map passes
Mesh topology changes from frame to frame
Makes motion blur difficult
NVIDIA CONFIDENTIAL
Ambient Occlusion Shading
Helps simulate the global illumination “look” of the
original movie
Self occlusion is the degree to which an object
shadows itself
“How much of the sky can I see from this point?”
Simulates a large spherical light surrounding the
scene
Popular in production rendering – Pearl Harbor
(ILM), Stuart Little 2 (Sony)
NVIDIA CONFIDENTIAL
Occlusion
N
NVIDIA CONFIDENTIAL
How To Calculate Occlusion
Shoot rays from surface in random directions over
the hemisphere (centered around the normal)
The percentage of rays that hit something is the
occlusion amount
Can also keep track of average of un-occluded
directions – “bent normal”
Some Renderman compliant renders (e.g. Entropy)
have a built-in occlusion() function that will do this
We can’t trace rays using graphics hardware (yet)
So we pre-calculate it!
NVIDIA CONFIDENTIAL
Occlusion Baking Tool
Uses ray-tracing engine to calculate occlusion
values for each vertex in control mesh
We used 128 rays / vertex
Stored as floating point scalar for each vertex and
each frame of the animation
Calculation took around 5 hours for 1000 frames
Subdivision code interpolates occlusion values
using cubic interpolation
Used as ambient term in shader
NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
Ogre Demo
Show demo
NVIDIA CONFIDENTIAL
Procedural Shading in Time Machine
Goals for the Time Machine
demo
Overview of effects
Metallic Paint
Wood
Chrome
Techniques used
Faux-BRDF reflection
Reveal and dXdT maps
Normal and DuDv
scaling
Dynamic Bump mapping
Performance Issues
Summary
NVIDIA CONFIDENTIAL
Why do Time Machine?
GPUs are much more programmable
Thanks to generalized dependent texturing, more
active textures (16 on GeForce FX) and (for our
purposes) unlimited blend operations, high-quality
animation is possible per-pixel
GeForce FX has >2x performance of GeForce 4Ti
Executing lots of per-pixel operations isn’t just
possible; it can be done in real time.
Previous per-pixel animation was limited
Animated textures
PDE / CA effects (see Mark Harris’ talk at GDC)
Goal : Full-scene per-pixel animation
NVIDIA CONFIDENTIAL
Why do Time Machine? (continued)
Neglected pick-up trucks demonstrate a wide
variety of surface effects, with intricate transitions
and boundaries
Paint oxidizing, bleaching and rusting
Vinyl cracking
Wood splintering and fading
And more…
Not possible with just per-vertex animation!
NVIDIA CONFIDENTIAL
Time Machine Effects : Paint
Paint textures:
Specular color shift
Bubbling
Oxidation
Rusting
60 Pixel Shader instructions, 11 textures
NVIDIA CONFIDENTIAL
•Paint Color
•Rust LUT
•Shadow map
•Spotlight mask
•Light Rust Color*
•Deep Rust Color*
•Ambient Light*
•Bubble Height*
•Reveal Time*
•New Environment*
•Old Environment*
(* = artist created)
Effects (cont’d) : Wood, Chrome, Glass
Wood fades and cracks
31 instructions, 6 textures
Chrome welts and corrodes
23 instructions, 8 textures
Headlights fog
24 instructions, 4 textures
NVIDIA CONFIDENTIAL
Procedural or Not?
Procedural shading normally replaces textures
with functions of several variables.
Time Machine uses textures liberally.
The only parameter to our shaders is time.
However, turning everything into math is expensive
Time Machine’s solution
Give artist direct control (textures) over final image,
use functions to control transitions
NVIDIA CONFIDENTIAL
Techniques : Faux-BRDF Reflection
Many automotive paints exhibit a color-shift as a
function of the light and viewer directions.
This effect has been approximated with analytic
BRDFs (Lafortune’s cosine lobes)
And measured by Cornell University’s graphics lab
BRDF factorization [McCool, Rusinkiewicz] is one
method to use this data on graphics hardware
Efficient representation with multiple 2D textures
Closely approximates the original BRDFs
But not necessarily the most efficient method for
automotive paint, and not artist-controllable.
Reflection intensity is uninteresting (largely Blinn)
Rotated/projected axes hard to visualize
NVIDIA CONFIDENTIAL
Techniques : Faux-BRDF Reflection 2
Our solution: project BRDF values onto a single 2D
texture, and factor out the intensity
Compute intensity in real-time, using (N.H)^s
Texture varies slowly, so it can be low-res (64x64).
Anti-aliasing texture fixes laser noise at grazing angles
For automotive paints, N.L and N.H work well for axes.
Not physically accurate, but fast and high-quality.
Easy for artists to tweak.
Dupont Cayman lacquer
NVIDIA CONFIDENTIAL
Mystique lacquer
Techniques : Reveal and dXdT maps
Artists do not want to paint hundreds of frames of
animation for a surface transition (e.g., paint->rust)
Ultimately, effect is just a conditional:
if (time > n) color = rust; else color = paint;
Or an interpolation between a start and end point
paint = interpolate(paint, bleach, s*(time-n));
So all intermediate values can be generated.
For continuous effects, use dXdT (velocity) maps
Can be stored in alpha in a DXT5 texture.
NVIDIA CONFIDENTIAL
Performance Concerns
Executing large shaders is expensive.
First rule of optimization: Keep inner loops tight
Shaders are the inner loop, run >1M times per frame.
But graphics cards have many parallel units
Vertex, fragment, and texture units
Modern GPUs do a great job of hiding texture latency
Bandwidth is unimportant in long shaders
Time Machine runs at virtually the same framerate on a
500/500 GeForceFX as it does on a 500/400 or 500/550
So not using textures is wasting performance!
NVIDIA CONFIDENTIAL
Performance Concerns…
What makes a good texture?
Saves math operations
8 (RGBA) or 16 (HILO) bit precision sufficient
Depends on a limited number of variables
Textures we used
Interpolating between light and dark rust layers
Required computing the difference between light and
dark layers’ reveal maps, and expanding to [0..1].
Function was dependent on current and reveal time.
Used to blend two texture maps
NVIDIA CONFIDENTIAL
Performance Concerns…
Textures Used, continued…
Surround Maps
Recomputing the normal requires knowing the heights of
4 texels (s-1,t), (s+1,t), (s,t+1) and (s,t-1)
Each height is only 1 8-bit component
Instead of 4 dependent fetches, we can pack all into 1
S(s,t) = [ H(s-1, t), H(s+1, t), H(s,t-1), H(s,t+1) ]
Saved 4 math ops and 3 texture fetches + shuffle logic
NVIDIA CONFIDENTIAL
Time Machine demo
Show demo
NVIDIA CONFIDENTIAL
Toys Demo - Simple Depth of Field
Render scene to color and depth textures
Generate mipmaps for color texture
Render full screen quad with “simpledof” shader:
Depth = tex(depthtex, texcoord)
Coc (circle of confusion) = abs(depth*scale + bias)
Color = txd(colortex, texcoord, (coc,0), (0,coc))
Scale and bias are derived from the camera:
Scale =
(aperture * focaldistance * planeinfocus * (zfar – znear)) /
((planeinfocus – focaldistance) * znear * zfar)
Bias =
(aperture * focaldistance * (znear – planeinfocus)) /
((planeinfocus * focaldistance) * znear)
NVIDIA CONFIDENTIAL
Artifacts: Bilinear Interpolation/Magnification
Bilinear artifacts in extreme back- and near-ground
Solution: multiple jittered samples
Even without jittering, a 4 or 5 sample rotated grid
pattern brings smaller artifacts under control
Larger artifacts need jittered samples, and more of
them
Then it’s just a tradeoff between noise from the
jittering and bilinear interpolation artifacts
(and of course the quality/performance tradeoff with
number of samples)
NVIDIA CONFIDENTIAL
Noise vs. Interpolation Artifacts
With Noise
NVIDIA CONFIDENTIAL
Without Noise
Artifacts: Depth Discontinuities
Near-ground (blurry) pixels don’t properly blend
out over top of mid-ground (sharp) pixels
Easy solution: Cheat!
Either don’t let objects get too far in front of the
plane in focus, or blur everything a little more when
they do – soft edges help hide this fairly well.
NVIDIA CONFIDENTIAL
Depth Discontinuities
NVIDIA CONFIDENTIAL
Fun With Color Matrices
Since we’re already rendering to a full-screen
texture, it’s easy to muck with the final image.
Operations are just rotations / scales in RGB space
Color (hue) shift
Saturation
Brightness
Contrast
These are all matrices, so compose them together,
and apply them as 3 dot products in the shader
NVIDIA CONFIDENTIAL
Original Image
NVIDIA CONFIDENTIAL
Colorshifted Image
NVIDIA CONFIDENTIAL
Black and White Image
NVIDIA CONFIDENTIAL
Toys Demo
Show demo
NVIDIA CONFIDENTIAL
Order Independent Transparency
Why is correct transparency hard?
Depth peeling
Two depth buffers
Enter the shadow map
Precision/invariance issues
Depth replace texture shader
Blending the layers
Other applications
NVIDIA CONFIDENTIAL
Can’t just glEnable(GL_BLEND)…
Good Transparency
with OIT
NVIDIA CONFIDENTIAL
Bad Transparency
without OIT
Why is correct transparency hard?
Most hardware does object-order rendering
Correct transparency requires sorted traversal
Have to render polygons in sorted order
Not very convenient
Polygons can’t intersect
Lot of extra application work
Especially difficult for dynamic scene databases
NVIDIA CONFIDENTIAL
Depth Peeling
The algorithm uses an “implicit sort” to extract
multiple depth layers
First pass render finds front-most fragment
color/depth
Each successive pass render finds (extracts) the
fragment color/depth for the next-nearest fragment
on a per pixel basis
Use dual depth buffers to compare previous nearest
fragment with current
Second “depth buffer” used for comparison (read
only) from texture [more on this later]
NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
Layer 0
Layer 1
Layer 2
Layer 3
Cross-section view of
depth peeling
Layer 0
0
depth
Layer 1
1
0
depth
Layer 2
1
0
depth
1
Depth peeling strips away depth layers with each
successive pass. The frames above show the
frontmost (leftmost) surfaces as bold black lines,
hidden surfaces as thin black lines, and “peeled away”
surfaces as light grey lines.
NVIDIA CONFIDENTIAL
Dual Depth Buffer Pseudo-code
for ( i = 0; i < num_passes; i++ )
{
clear color buffer
depth unit 0:
if(i == 0) { disable depth test }
else
{ enable depth test }
bind depth buffer (i % 2)
disable depth writes /* read-only depth test */
set depth func to GREATER
depth unit 1:
bind depth buffer ((i+1) % 2)
clear depth buffer
enable depth writes;
enable depth test;
set depth func to LESS
render scene
save color buffer RGBA as layer i
}
NVIDIA CONFIDENTIAL
Implementation
There is no “dual depth buffer” extension to
OpenGL, so what can we do?
Just need one depth test with writeable depth
buffer – the other can be read-only
Shadow mapping is a read-only depth test!
Depth test can have an arbitrary camera location
Other interesting uses for clip volumes
Fast copies make this proposition reasonable
Copies will be unnecessary in the future…
NVIDIA CONFIDENTIAL
Precision / Invariance issues
Using shadow mapping hardware introduces
precision and invariance issues
depth rasterization usually just needs to match
output depth buffer precision, and requires no
perspective correction
Texture hardware requires perspective correction
and projection at high precision
Making things match would be difficult without the
DEPTH_REPLACE texture shader
Computes with texture hardware at texture precision
Solves invariance problems at some extra expense
Will be cheaper in the future…
NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
1 layer
2 layers
3 layers
4 layers
Compositing
Each time we peel, we capture the RGBA, then as
a final step, we blend all the layers together from
back to front
Opaque fragments completely overwrite previous
transparent ones
NVIDIA CONFIDENTIAL
Conclusions
Results are nice!
Get correct transparency without invasive changes
to internal data structures
Can be “bolted on” to existing CAD/CAM apps
Requires n scene traversals for n correctly sorted
depths
n = 4 is often quite satisfactory (see previous slide)
Shadow maps are for more than shadows!
NVIDIA CONFIDENTIAL
Questions?
cem@nvidia.com
http://developer.nvidia.com
http://developer.nvidia.com/cg/
http://www.cgshaders.org/
NVIDIA CONFIDENTIAL
Download