Uploaded by 冉子民

pose estimation in Neural Volume Rendering for Novel View Synthesis

advertisement
Neural Volume Rendering for Novel View
Synthesis
Candidate: Zimin Ran
zimin.ran@student.uts.edu.au
Student ID: 14309835
Supervisor: Linchao Zhu
August 2022
Candidature Assessment Report Stage 1
Abstract • Background • Research question and its contribution to
knowledge • Research objectives and scope • Comprehensive literature
review • Proposed research methodology and justification Page 7 of 9
• Ethics and risk consideration. You are required to demonstrate an
awareness and understanding of social, ethical and environmental
implications via the Postgraduate Students’ Research Information
Form including an Ethics Committee Submission*, if appropriate. •
Research plan. Show (1) a timeline of the steps you will take to
ensure you complete your research program in the prescribed time; and
(2) any resource requirements for the research. • Progress to date •
References
1 Abstract
The synthesis of photorealistic imagery content has been the main
challenge of computer graphics research for decades. Typically,
images are synthesized using traditional rendering algorithms such
as ray tracing, which take as inputs precisely defined
representations of scene geometry and material properties.
Traditionally, scene representations are triangle meshes with related
textures, point clouds, volumetric grids, or implicit surface
functions. They are manufactured by humans, depth sensors, CT scans,
and truncated signed distance fields, respectively. Inverse rendering
is the process of reconstruction of scene representation by
differentiable rendering losses. Neural rendering is the combination
between traditional computer graphics and deep learning to render
photo realistic imagery content from real world observations. The
main advantage of neural rendering is that it is designed in 3Dconsistent, enabling novel viewpoint synthesis from a captured scene.
Compared with methods that handle static real-world scenes, neural
rendering is capable of modelling deforming objects to edit and
composite scenes. Most approaches are scene-specific, generalized
across object classes techniques can be used for generation tasks.
2 Introduction
Neural volume rendering is an active research field, which can
generate photo-realistic imagery content. The neural network achieves
an impressive result in estimating the density and color of scene
objects. One of the sub-problems in neural volume rendering is camera
pose estimation. Solving the problem of camera pose estimation
benefits to the lots of industry applications. It can utils random
images without any focal length, intrinsic camera parameters and
extrinsic camera parameters, as input to render imagery with unseen
angles.
My research focuses on camera pose estimation in neural rendering,
which train on randomly mobile phone photos. I outline my research
and show the result in the following sections.
Research questions:
1,How to estimate the camera poses by neural network.
2, How to render the image and estimate the camera pose at the same
time.
3, How to estimate the camera poses from mobile phone photographs
without any rotation, translation and focal information.
Research objectives:
1, design a neural network to estimate the focal and camera poses
from phone photos without any real camera pose information.
2, propose a model directly estimate camera pose from images.
3, build up a composite model to estimate camera pose and render
photo-realistic images with unseen angles simultaneously.
Research contribution:
1, Shape a focal net and a pose net to estimate camera poses
information.
2, Remove the pre-compute process of classical NeRF.
3, Estimate camera pose directly from an image by inversion net.
Neural Volume Rendering for Novel
View Synthesis
1 Introduction
The generation of controllable photo-realistic images and videos is a
long-standing computer graphics challenge. Metaverse is a popular
concept which is closely related to virtual reality. The development
of virtual reality and augmented reality relies on computer graphics.
Computer graphics has a huge potential in applications of the virtual
reality market. Traditionally, the techniques for meeting the
challenge of synthesizing photo-real imagery are based on the laws of
physics. These methods take physical parameters as input to simulate
light transport to render photo-realistic imagery. To do the ray
tracing, all the physical properties need to be provided, which
contain specific representation information about the material
properties (e.g., opacity), scene geometry (e.g., reflectivity),
camera intrinsic (e.g., focal), camera extrinsic (e.g., rotation and
translation), and illumination. Besides the physics-based approach,
the mathematically based methods need fewer parameters in training.
Using the above classical approach to render imagery of real-world
scenes. The estimation of physical parameters from images is an
extremely challenging task known as inverse rendering. Recently,
neural rendering became an active research field, which utils neural
networks to learn to render from existing observations. Neural
rendering is a combination of traditional computer graphics and deep
learning. Similar to traditional computer graphics, neural rendering
aims to synthesize controllable photo-realistic imagery content. Deep
learning can used as a function approximator, converting scene
parameters into output imagery. In other words, a deep neural network
finds the mapping between controllable scene parameters and the
corresponding output synthesis imagery content. Neural Rendering
enables implicit or explicit control of scene properties, such as
pose, appearance, geometry, illumination, semantic structure, and
camera parameters. This powerful, data-driven method can generate
controllable imagery content, such as relighting, novel view
synthesis, deformation of a scene, and compositing. It must navigate
the trade-off between underfitting and overfitting, for example,
representing the training dataset well or generalization to unseen
scenes.
2 Related work:
The generation of photorealistic images or videos become the main
challenge of computer graphics research for last few years.
Traditional rendering techniques are physics-based ray tracing and
then render new images. However, classical rendering methods take
precise scene geometry and material properties. Scene representations
are classified in two different types: explicit surface (such as,
volumetric grids and point clouds) and implicit surface functions.
They can be manufactured by humans, CT scans, depth sensors, and
truncated signed distance fields. Neural rendering is based on
traditional computer graphics and deep learning to render new photo
realistic images from real world scene. Deep learning can be seen as
a function approximator, which maps scene parameters to related
images. In contrast of the traditional rendering techniques, neural
rendering enables to modelling dynamic scenes and edit the objects.
Estimation the physical parameters from the real scene are extremely
full of challenge, this process is known as inverse rendering. Neural
rendering is capable of meeting the challenges in inverse rendering
and achieves an impressive performance. Pose estimation utils neural
network to map the camera pose parameters (such as, focal, rotation
and translation) to the corresponding images. The camera poses
parameters have effect on the rendering content.
2.1 Feature Selection
2.2 Emotion Classification
3 Research Methodologies:
3.1 Music Feature Selection
3.2 Emotion Annotation
3.3 Music Emotion Classification
4 Experiment and Discussion
I have run the classical nerf model with mobile phone photographs as
input. Each scene takes 48 photos as training dataset. The limitation
of classical nerf is that it has a pre-compute camera parameters
step. For removing the pre-compute stage, I add a focal net and pose
net, both can be trained together with the nerf model. After training
about 5000 epochs, it can get a PSNR result in 24.875. This is a very
impressive result. The rendered video is smooth and clear with a
relative high resolution. I plan to add an inversion net for
improving the accuracy of pose estimation in the next month. The
inversion net consists of a VIT, which takes the rendered images as
input and then output embedding. Finally, computing the MSE with the
output of the pose net and the output of the inversion net.
Download