Neural Volume Rendering for Novel View Synthesis Candidate: Zimin Ran zimin.ran@student.uts.edu.au Student ID: 14309835 Supervisor: Linchao Zhu August 2022 Candidature Assessment Report Stage 1 Abstract • Background • Research question and its contribution to knowledge • Research objectives and scope • Comprehensive literature review • Proposed research methodology and justification Page 7 of 9 • Ethics and risk consideration. You are required to demonstrate an awareness and understanding of social, ethical and environmental implications via the Postgraduate Students’ Research Information Form including an Ethics Committee Submission*, if appropriate. • Research plan. Show (1) a timeline of the steps you will take to ensure you complete your research program in the prescribed time; and (2) any resource requirements for the research. • Progress to date • References 1 Abstract The synthesis of photorealistic imagery content has been the main challenge of computer graphics research for decades. Typically, images are synthesized using traditional rendering algorithms such as ray tracing, which take as inputs precisely defined representations of scene geometry and material properties. Traditionally, scene representations are triangle meshes with related textures, point clouds, volumetric grids, or implicit surface functions. They are manufactured by humans, depth sensors, CT scans, and truncated signed distance fields, respectively. Inverse rendering is the process of reconstruction of scene representation by differentiable rendering losses. Neural rendering is the combination between traditional computer graphics and deep learning to render photo realistic imagery content from real world observations. The main advantage of neural rendering is that it is designed in 3Dconsistent, enabling novel viewpoint synthesis from a captured scene. Compared with methods that handle static real-world scenes, neural rendering is capable of modelling deforming objects to edit and composite scenes. Most approaches are scene-specific, generalized across object classes techniques can be used for generation tasks. 2 Introduction Neural volume rendering is an active research field, which can generate photo-realistic imagery content. The neural network achieves an impressive result in estimating the density and color of scene objects. One of the sub-problems in neural volume rendering is camera pose estimation. Solving the problem of camera pose estimation benefits to the lots of industry applications. It can utils random images without any focal length, intrinsic camera parameters and extrinsic camera parameters, as input to render imagery with unseen angles. My research focuses on camera pose estimation in neural rendering, which train on randomly mobile phone photos. I outline my research and show the result in the following sections. Research questions: 1,How to estimate the camera poses by neural network. 2, How to render the image and estimate the camera pose at the same time. 3, How to estimate the camera poses from mobile phone photographs without any rotation, translation and focal information. Research objectives: 1, design a neural network to estimate the focal and camera poses from phone photos without any real camera pose information. 2, propose a model directly estimate camera pose from images. 3, build up a composite model to estimate camera pose and render photo-realistic images with unseen angles simultaneously. Research contribution: 1, Shape a focal net and a pose net to estimate camera poses information. 2, Remove the pre-compute process of classical NeRF. 3, Estimate camera pose directly from an image by inversion net. Neural Volume Rendering for Novel View Synthesis 1 Introduction The generation of controllable photo-realistic images and videos is a long-standing computer graphics challenge. Metaverse is a popular concept which is closely related to virtual reality. The development of virtual reality and augmented reality relies on computer graphics. Computer graphics has a huge potential in applications of the virtual reality market. Traditionally, the techniques for meeting the challenge of synthesizing photo-real imagery are based on the laws of physics. These methods take physical parameters as input to simulate light transport to render photo-realistic imagery. To do the ray tracing, all the physical properties need to be provided, which contain specific representation information about the material properties (e.g., opacity), scene geometry (e.g., reflectivity), camera intrinsic (e.g., focal), camera extrinsic (e.g., rotation and translation), and illumination. Besides the physics-based approach, the mathematically based methods need fewer parameters in training. Using the above classical approach to render imagery of real-world scenes. The estimation of physical parameters from images is an extremely challenging task known as inverse rendering. Recently, neural rendering became an active research field, which utils neural networks to learn to render from existing observations. Neural rendering is a combination of traditional computer graphics and deep learning. Similar to traditional computer graphics, neural rendering aims to synthesize controllable photo-realistic imagery content. Deep learning can used as a function approximator, converting scene parameters into output imagery. In other words, a deep neural network finds the mapping between controllable scene parameters and the corresponding output synthesis imagery content. Neural Rendering enables implicit or explicit control of scene properties, such as pose, appearance, geometry, illumination, semantic structure, and camera parameters. This powerful, data-driven method can generate controllable imagery content, such as relighting, novel view synthesis, deformation of a scene, and compositing. It must navigate the trade-off between underfitting and overfitting, for example, representing the training dataset well or generalization to unseen scenes. 2 Related work: The generation of photorealistic images or videos become the main challenge of computer graphics research for last few years. Traditional rendering techniques are physics-based ray tracing and then render new images. However, classical rendering methods take precise scene geometry and material properties. Scene representations are classified in two different types: explicit surface (such as, volumetric grids and point clouds) and implicit surface functions. They can be manufactured by humans, CT scans, depth sensors, and truncated signed distance fields. Neural rendering is based on traditional computer graphics and deep learning to render new photo realistic images from real world scene. Deep learning can be seen as a function approximator, which maps scene parameters to related images. In contrast of the traditional rendering techniques, neural rendering enables to modelling dynamic scenes and edit the objects. Estimation the physical parameters from the real scene are extremely full of challenge, this process is known as inverse rendering. Neural rendering is capable of meeting the challenges in inverse rendering and achieves an impressive performance. Pose estimation utils neural network to map the camera pose parameters (such as, focal, rotation and translation) to the corresponding images. The camera poses parameters have effect on the rendering content. 2.1 Feature Selection 2.2 Emotion Classification 3 Research Methodologies: 3.1 Music Feature Selection 3.2 Emotion Annotation 3.3 Music Emotion Classification 4 Experiment and Discussion I have run the classical nerf model with mobile phone photographs as input. Each scene takes 48 photos as training dataset. The limitation of classical nerf is that it has a pre-compute camera parameters step. For removing the pre-compute stage, I add a focal net and pose net, both can be trained together with the nerf model. After training about 5000 epochs, it can get a PSNR result in 24.875. This is a very impressive result. The rendered video is smooth and clear with a relative high resolution. I plan to add an inversion net for improving the accuracy of pose estimation in the next month. The inversion net consists of a VIT, which takes the rendered images as input and then output embedding. Finally, computing the MSE with the output of the pose net and the output of the inversion net.