Zen of multi core rendering
» Corrinne Yu
» Halo team Principal engine programmer
» Corrinne.Yu@microsoft.com
Zen of multi core rendering
» Take away
» Compilation and survey of effective rendering techniques for current generation multi core console hardware
Rendering equation
Rendering equation
» Radiance leaving a point
» Integral of radiance in all direction
Rendering equation
» Radiance leaving a point
» Integral of radiance in all direction
» Reflectance distribution function
Rendering equation
» Radiance leaving a point
» Integral of radiance in all direction
» Reflectance distribution function
» Light coming inward to surface position
Rendering equation
» Radiance leaving a point
» Integral of radiance in all direction
» Reflectance distribution function
» Light coming inward to surface position
» Visibility of light to surface position
Rendering equation
» Integral of radiance in all direction
» Reflectance distribution function
» Light coming inward to surface position
» Visibility of light to surface position
» Attenuation of inward light due to incident angle with surface normal
Compromise and cheats
» This is computed per surface element
» This is infeasibly expensive
» In the past, we made quality compromises throughout to make run time rendering possible
First generation
» 1 to 4 dynamic lights
» Simple point lights
» Lambertian
» Blinn-Phong approximation
» Pre-computed diffuse radiosity
» Shadow map optional
Hardware
» 117 million triangles per second
» 0.933 gigapixels per second
» 1.86 giga texels per second
» 6.4 gigabytes of bandwidth per second
» 64 megabytes of video memory
Hardware
» 117 million triangles per second
» 0.933 gigapixels per second
» 1.86 giga texels per second
» 6.4 gigabytes of bandwidth per second
» 64 megabytes of video memory
Second generation
» 500 million triangles per second
» 4 gigapixels per second
» 8 giga texels per second
» 256 gigabytes of bandwidth per second
» 512 megabytes of video memory
Second generation
» 4.27x triangle throughput
» 4.29x pixel fill rate
» 4.29x texel rate
» 40x bandwidth
» 8x video memory
Second generation
» 4.27x triangle throughput
» 4.29x pixel fill rate
» 4.29x texel rate
» 40x bandwidth
» 8x video memory
Second generation
» Large number of lights of precomputed radiance transfer
» Environment and area lights
» Realistic reflectance models
» Cook Torrance, Ward
» Shadow map
Large lights integral
» Large number of lights integral
» Static geometry
» Precomputed visibility
» Spatially non-varying BRDF's
» Low-frequency illumination
» Image-space resolution limited
Multi core generation
» 70x triangle throughput
» 450x pixel fill rate
» 390x texel rate
» 110x bandwidth
» 16x video memory
Multi core generation
» 70x triangle throughput
» 450x pixel fill rate
» 390x texel rate
» 110x bandwidth
» 16x video memory
Amdahl’s law
Multi core insight
» Fill rate is achieved by completely asynchronous out of order VPU
(Vector Processing Unit) computation
» My experience with CUDA is that there are intentionally no synchronization primitives
Multi core insight
» On Larrabee, each core has 4 hardware threads
» Each thread is out of order
» But for one thread’s execution, the vertices and pixels are synchronized
Multi core insight
» So there are essentially 256 out of order processes
» Each consisting of a batch of about
16 synchronized pixels or vertices in flight at any one time
Multi core insight
» Expectation is shader flops will grow the most
» Speed not from higher clock rate
» Speed from larger number of low power cores
» Memory is not exepcted to catch up to shader flops
Multi core insight
» ALU's or VPU's to increase by 300x
» Future is tfetch bound, not ALU bound
» Homogeneous computing
» Keep ALU's or VPU's very busy with cache coherent local data
Multi core generation
» Occlusion from static geometry
» Precomputed visibility
» Spatially non-varying BRDF's
» Low-frequency illumination
» Image-space resolution limited
Multi core generation
» Occlusion from dynamic geometry
» Precomputed visibility
» Spatially non-varying BRDF's
» Low-frequency illumination
» Image-space resolution limited
Multi core generation
» Occlusion from dynamic geometry
» Dynamic visibility computation
» Spatially non-varying BRDF's
» Low-frequency illumination
» Image-space resolution limited
Multi core generation
» Occlusion from dynamic geometry
» Dynamic visibility computation
» Spatially varying BRDF's
» Low-frequency illumination
» Image-space resolution limited
Multi core generation
» Occlusion from dynamic geometry
» Dynamic visibility computation
» Spatially varying BRDF's
» High -frequency illumination
» Image-space resolution limited
Multi core generation
» Occlusion from dynamic geometry
» Dynamic visibility computation
» Spatially varying BRDF's
» High -frequency illumination
» High quality resolution
Multi core generation
» Occlusion from dynamic geometry
» Dynamic visibility computation
» Spatially varying BRDF's
» High -frequency illumination
» High quality resolution
» Remove remaining compromises
Practical techniques
» Directional light map basis
» Zonal harmonics
» Screen space ambient occlusion
» Shadow map
Directional light map
» Proposed by Valve's G McTaggert for Half Life
» Used in many game games like
Half Life and Unreal
Directional light map
» Spatial axial basis
» (- 1 / sqrt(6), - 1 / sqrt(2), 1 / sqrt(3) )
» ( - 1 / sqrt(6), 1 / sqrt(2), 1 / sqrt(3) )
» ( sqrt(2 / 3), 0, 1 / sqrt(3) )
Analysis
» Static radiance can interact with directional changes of reflectance surface
» Per pixel normal reflectance of radiosity
» Per pixel normal specularity
Analysis
» Basis and precision are not uniformly distributed
» Radiance is correct at exactly 3 clamped directions
» Radiance undersampling occurs for wide ranges of directions
» Only for hemisphere
Pre-computed radiance transfer
» Zonal harmonics
» R Ramamoorthi and P Hanrahan came up with an efficient representation for irradiance environment
Irradiance environment map
» Only 1 st 2 orders of zonal harmonics
» Only use 9 terms
» Average errors only 1% against raytracing
» Much less error prone than directional light maps
Analysis
» Completely feasible in current hardware
» Better than directional light maps
Analysis
» Completely feasible in current hardware
» Better than directional light maps
» Only the lowest of frequencies
» Incapable of representing dynamic local lights
Screen space ambient occlusion
» Developed by V Kajalin
» Used first in Crysis
» Used by game games like Crysis and Unreal
» Sample depth difference between screen space neighbors as occlusion factor
Optimization
» Too many samples in reality
» In practice read small number of samples from a randomly rotated kernel
» Results are filtered to reduce noise
Analysis
» Too many samples in reality
» In practice read small number of samples from a randomly rotated kernel
» Results are filtered to reduce noise
» Low number samples lead to low impact visual effect
Shadow map
» Xbox 360 has several hardware bilinear weight fetch instructions
» Performance boosters
» Use it for hardware accelerated percentage closer filtering
» getWeights1D, getWeights2D, getWeights3D, getWeightsCube
Shadow map
» Poisson filter with rotating kernel is shipped in many games, including Fable 2, Brothers in
Arms, and so on
Poisson distribution
Poisson filter
» Generate random numbers with this distribution
» Rotate them
» Offset source sample by the jitters
» Render weighted accumulation
Analysis
» Shadow map itself has no soft edge
» Soft shadow map is created from jitters and filters
» Shadow map is an image based technique of finite resolution
Analysis
» Still a fast technique for high frequency local lighting
» 10000 spherical harmonics term will not give you the occlusion shadow map will give you
» Still useful for a very long time
Multi core generation
» Occlusion from dynamic geometry
» Dynamic visibility computation
» Spatially varying BRDF's
» High -frequency illumination
» High quality resolution
» Remove remaining compromises
Dynamic radiance
» Haar wavelet radiance caches
» Radiance transfer factorization
» Dimensionality reduction
» Linear discriminant analysis
» BRDF factorization
Dynamic radiance linear discriminant analysis
BRDF factorization wavelet caches distance cube (or hemi cube) radiance factorization wavelet radiance caches rasterization factorized radiance caches factorized
BRDF dynamic radiance
Radiance caches wavelet caches distance cube (or hemi cube) radiance factorization wavelet radiance caches
Wavelet radiance caches
» Haar wavelet basis
» Visibility
» Radiance factorization
Wavelet radiance caches
» Haar wavelet basis
» Visibility
» Radiance factorization
Haar wavelet basis
» Spherical harmonics is not the only basis available for radiance transfer
» Radiance and sum of area lights can also be represented by Haar wavelets
Haar wavelet radiance
» What is exciting about Haar wavelet is that its radiance visibility triple integral is fast enough to run on GPU in real time
Haar wavelet
2D Haar wavelet and visibility
» The visibility function V(x, theta) is also a binary function
» Multiplying visibility to wavelet radiance is spatially and physically turning parts of the wavelet equation on and off
Wavelet and integrals
» The integral of the product of wavelet radiance and visibility also simplifies the run-time equation
Wavelet visibility insights
» In some ways, spherical harmonics is the frequency corrected distribution of the basis in directional light map
» Zonal harmonics correctly samples and stores radiance contribution without a preference to a direction
Wavelet visibility insights
» “Simulating soft shadows with graphics hardware” Heckbert,
Herf, 1997
» Heckbert rendered soft shadows by rendering shadows from 100 lights to create shadow penumbra
Analysis
» No BRDF and inter-reflection
» No radiance transfer
» No specular reflectance
Analysis
» No BRDF and inter-reflection
» No radiance transfer
» No specular reflectance
» It was GPU accelerated for its time!
Multi core rendering
» What is the modern multi core shader / homogenous function pipeline version of this technique?
Multi core rendering
» Not just shadows, the full radiance illumination model
Multi core rendering
» Not just shadows, the full radiance illumination model
» Not one light per pass, sample sparse wavelet data efficiently in tfetchCube
Wavelet radiance caches
» Haar wavelet basis
» Visibility
» Radiance factorization
Dynamic radiance
» For dynamic geometry, convolution of the visibility changes with the radiance wavelet coefficients must be performed before the radiance is applied
» Still challenging to perform at run time
Ray tracing or radiosity
» Capture only occlusion
» Capture the full transport and full reflectance distribution
» GPU occlusion through rasterization
» GPU kd-tree line trace
Capture only occlusion
» Feasible with current hardware
» Fast
» GPU side, hardware occlusion
» CPU side, line trace into kd-tree
» Visually unsophisticated
Capture full reflectance transport
» Visually much more complex than
GPU occlusion
» More expensive
» Fill out wavelet probes on different threads across multiple frames
» Unfinished wavelet probes still useful for radiance
Radiosity
» The hemi-cube: a radiosity solution for complex environments. Cohen and
Greenberg 1985
» Use GPU to rasterize radiance
Radiosity
» Great for low frequency spherical harmonics
» First pass has direct lighting only
» For high frequency wavelets, needs excessively high resolution
» No caustics, subsurface scattering
Radiosity
» Low resolution first pass with GPU hemi-cube
» Higher frequency passes with direction cube kd-tree line tracing
Raytracing
» Direction cube techniques and ray tracer caches can take up too much memory
» Reyes ray tracing may be more parallelizable, but be careful of bucket load balancing
Bounding volume hierarchy
» Kd-tree can be 15x faster than
BSP for ray tracing
» SAH (surface area heuristic) only necessary in deeper nodes
» For nodes close to root, divide by number of objects in boxes are good enough
Wavelet radiance analysis
» It takes about 18 to 20 terms to represent all frequencies well
» This is twice the number of terms for SH irradiance maps (9 terms)
Wavelet radiance analysis
» Memory is much less because the probes are not pre-computed across the level
» Fetching the terms to synthesize the radiance is twice or more the pixel ALU cost
Wavelet radiance analysis
» 18 wavelet terms, on the other hand, capture high frequency quality not captured by 10000 term spherical harmonics
» Not exactly a 1:1 trade-off for high frequency or all frequency solution
Wavelet radiance caches
» Haar wavelet basis
» Visibility
» Radiance factorization
Radiance factorization
» Radiance factorization is important to dynamic radiance transfer
» Decompose radiance transfer
Radiance factorization
» Spatial contribution
Radiance factorization
» Spatial contribution
» Angular contribution
Radiance factorization
» Spatial contribution
» Angular contribution
» Temporal contribution
Radiance factorization
» Spatial contribution
» Angular contribution
» Temporal contribution
» Visibility contribution
Dimensionality reduction
» Exponential growth with dimensionality and contribution factors
» Dimensionality reduction to factorize the radiance triple integral
Dimensionality reduction
» In reality, there top factors impact output more than less relevant factors
Dimensionality reduction
» Principal components analysis
» Linear discriminant analysis
Principal components
» Principal
» Orthogonal linear combinations with the largest variance
» Secondary
» Linear combination with the second largest variance and orthogonal to principal
Principal components
» Use principal components to select important factors in the original radiance equation
» Keep separating until factors are separated into components
» Equation factored out into dynamic factors
Principal components
» We can see how factoring principal components can factor out the primary impact of dynamic variables in the radiance equation
Principal component
» PCA remaps an apparently complex function into feature or factor separable distribution
Principal components
Dimensionality reduction
» PCA works best with purely orthogonal data
» Unfortunately, radiance transfer is not very orthogonal at all
» For better results, a dimensionality reduction algorithm should find separation even when there is none
Linear discriminant analysis
» Works best for Gaussian distribution clusters
» Finds separation even when there is (almost) none
» LDA has potential to out-perform
PCA in factorization of the rendering triple integral
Linear discriminant analysis
» Same idea as PCA
» Maximize separation by classification
» Minimize variance within the classification after projection
» Principal, secondary, …
D* for rendering?
» B Guenter at MSR
» Developed a compiler and declarative meta language D*
» Creates optimized source code
» Solve for dynamics of an equivalent system and no constraints
D* for rendering?
» With fewer degrees of freedom
» Uses analytic / symbolic approaches based on Lagrangian dynamics
» Coordinate reduction and projection
D* for rendering?
» Derive optional equations to solve for forward dynamics of the system
» Necessary derivatives to linearlize the system’s equations of motion at any given configuration
D* for analytical models
» Is there potential for D* to reduce dimension symbolically for the render equation?
Factorization technology
» LDA and D* can be applied to factorize the triple integral
» Factorization is essential to dynamic radiance
Dynamic radiance
» Haar wavelet radiance caches
» Radiance transfer factorization
» Dimensionality reduction
» Linear discriminant analysis
» BRDF factorization
Dynamic radiance
BRDF factorization
Dynamic scenes
» Before light reaches the eye, light undergoes a huge number of physical interactions with many objects
» When these objects deform, animate, move, change, gets destroyed, reflectance distribution should update accordingly
Dynamic radiance
» Factored dynamic radiance requires BRDF cooperation
» Factored spatial radiance transfer, factored specular radiance transfer, needs to be evaluated with only the BRDF lobes that are affected
BRDF factorization
» Efficiency and compression
» Specular lobes require higher order basis for fidelity
» Factorization keep the basis cost down
BRDFs
» Cook Torrance
» Oren Nayar
» Ward
» Linear combination of measured
BRDFs
BRDF factorization prior work
» BRDF factorization
» “Interactive relighting with dynamic BRDFs” MSRA: Sun Zhou
Chen Lin Shi Guo 2007
» They used PCA, not LDA.
» I learned good BRDF factorization practices from this paper.
Factorization
» The challenge of dynamic scene is that given a static world, the radiance inter-reflectance is determined by the configuration of the objects
» We need factorization that takes deformation into account
Haar and factorization
» Another reason I became interested in Haar wavelet representation of radiance is that it adapts very well with factorized tensors generated by LDA
Summary
» Occlusion from dynamic geometry
» Dynamic visibility computation
» Spatially varying BRDF's
» High -frequency illumination
» High quality resolution
» Remove remaining compromises
Long tail Xbox 360
Long tail Xbox 360
» Use LDA at build time to reduce dimensionality
» Combine classifications
» Reduce number of run time variables to principal components
» Speed optimization
Future work
Future work
» Spherical wavelet instead of 2D haar wavelet?
Future work
» Spherical wavelet instead of 2D haar wavelet?
» Nonlinear and kernel dimensionality reduction instead of
LDA?
Future work
» Spherical wavelet instead of 2D haar wavelet?
» Nonlinear and kernel dimensionality reduction instead of
LDA?
» Dimensionality reduction on a symbolic level?
Summary
» Rally effort to develop symbolic kernels for dynamic radiance transfer
» Rally effort to factorize the rendering equation triple integral with mathematic techniques or human manual optimization
Thank you
» Corrinne.Yu@microsoft.com
» Continue our discussion and future work to implement dynamic radiance at corrinnesdotplan.blogspot.com
» Please fill in the survey.