Introduction 3D imaging is otherwise widely called as stereoscopy. This technique is widely used for creation or enhancement of the 2D image by increasing the illusion of depth with the help of binocular vision. Almost all kinds of stereoscopic methods are based on 2 images that is one from the left view and the other one from the right view. These 2 images are then joined together to give an illusion of 3D view along with inclusion of the depth. Nowadays 3D television is the major milestone of the visual media. In recent years, researchers have focused on developing algorithms for acquiring images, and converting them to 3D model using depth analysis. The third dimension can usually be perceived only by human vision. The eyes visualize the depth and the brain reconstructs the third dimension with the help of various views visualized by the eyes. The researchers used this strategy for reconstructing the 3D model from various views with the help of certain parameters of disparity and calibration. In recent days there are special cameras which help in capturing the 3D model of the view directly. Few examples are stereoscopic dual camera, depth-range camera etc. These cameras usually capture the RGB component of the image and its corresponding depth map. The depth map is defined as the function which helps in evaluating the depth of an object at a point (i.e. at pixel position). Usually the intensity is considered to be the depth. 3D reconstruction is one of the most complex issues of deep learning systems. There have been multiple types of research in this field, and almost everything has been tried on it — computer vision, computer graphics and machine learning, but to no avail. However, that has resulted in CNN or convolutional neural networks foraying into this field, which has yielded some success. 1 Background Study Recovering the lost dimension during image acquisition from any normal camera has been a hot research area in the field of computer vision for more than a decade. The literature review shows that the research methodology has been changed from time to time. More precisely, we can divide the conversion of 2D images to 3D model reconstruction into three generations. The first generation learns the 3D to 2D image projection process by utilizing the mathematical and geometrical information using some mathematical or algorithmic solution. These types of solutions usually require multiple images that are captured using specially calibrated cameras. For example, using some multiview of an object with constant angle changing that can cover all the 360 degrees of an object, we can compute the geometrical points of the object. Using some triangularization techniques, we can join these points to make a 3D model. The second generation of 2D to 3D model conversion utilizes the accurately segmented 2D silhouettes. This generation leads to a reasonable 3D model generation, but it requires specially designed calibrated cameras to capture the image of the same object from every different angle. This type of technique is not feasible or more practical because of the complex image capturing techniques. Humans can assume the shape of the object using prior knowledge about some objects and predict what an object will look like from another unseen viewpoint. The computer vision-based techniques are inspired by human vision to convert 2D images to 3D models. With the availability of large-scale data sets, deep learning research has evolved in 3D reconstruction from a single 2D image. A deep-belief-network-based 3D model was proposed to learn the 3D model from a single 2D image. It is considered one of the earlier neural-network-based data-driven models to reproduce the 3D model. 2 Analysis 3.1 Geometry-Based Reconstruction The 3D reconstruction using geometry-based methods requires complex geometrical information, and most of these methods are scene-dependent. A method for 3D human reconstruction was proposed based on geometric information. Some other methods focused on improving the quality of 3D sensory inputs such as multiview cameras and 3D scanners and then converting these data into a 3D model . However, all of these methods required more than one view of an object to capture sufficient geometry for 3D reconstruction. For 3D reconstruction from a single 2D image, it is difficult to extract geometrical information, making it difficult to formulate a 3D model. Moreover, we need to preserve the depth information of the scene or object to reconstruct the model in 3D. Scene reconstruction and modelling are two major tasks of 2D and 3D Computer Vision. The reconstruction offers us the exact observation of the 2-dimensional and 3dimensional world, whereas, modelling allows us to perceive it accurately. Both of these tasks have always been active areas of research due to their wide range of potential applications, such as scene representation, under- standing, and robot navigation. For a moving 2D-3D camera setup, the 3D reconstruction of the scene can be obtained by registering a sequence of point clouds with the help of Visual Odometry (VO) measurement. However, the VO-based registration is valid only for the static scene parts. Therefore, such reconstruction suf fers from several visual artifacts due to the dynamic parts. In this regard, recent work by Jiang et al. [4–6] categorizes the scene into static and dynamic parts before performing VO. Their method focuses on improving VO measurements, and the attempted dynamic object reconstruction is rather preliminary and naive. In this work, we focus on the high quality 3 Fig. 1: Moving Car Reconstruction from a Mobile Platform: Top are selected frames of a moving car. Middle show the reg- istered sparse point cloud, the smoothed point cloud, and the reconstructed mesh of the point cloud, respectively. Bottom show the fine reconstruction in different views. 3.2 Learning-Based Reconstruction Learning-based reconstruction approaches utilize data-driven volumetric 3D model synthesis. The research community has leveraged improvements in deep learning to enable efficient modeling of a 2D image into a 3D model. With the availability of large scale data sets such as Shape Net, most researchers focus on developing a 3D voxelized model from a single 2D image. Recently, various approaches have been proposed for achieving this task. One study shows that a 3D morph able shape can be generated from an image of a human face, but it requires many manual interactions and high-quality 3D scanning of the face. Some methods suggest learning a 3D shape model from key points or silhouettes. In some studies, the single image’s depth map is first calculated using machine-learning-based techniques, and then a 3D model is constructed using RGB-D images. A convolution neural network (CNN) has recently become popular to predict the geometry directly from a single image by using an encoder–decoder-based architecture. The encoder extracts the features from the single image, while the decoder generates the model based on features extracted by the encoder. In another study, deep-CNNbased models were learned, in which a single input image is directly mapped to output 4 3D representation for the 3D model generation in a single step. The authors of another study proposed a 3D recurrent-reconstruction-neural-network (RRNN)-based technique, in which the generation of the 3D model is performed in steps using a 2D image as input. Some studies, such as, used a 2D image along with depth information as input to the 3D-based U-Net architecture. For 3D appearance rendering, Groueix et al. used a convolution encoder–decoder-based architecture to generate the 3D scene from a single image as an input. Then, Haoqiang et al., by incorporating a differentiable appearance sampling mechanism, further improved the quality of the generated 3D scene. 3.3 Object reconstruction In order to use the captured hand motion for 3D reconstruction, we have to infer the contact points with the object. This is described in Section 4.1. The reconstruction process based on the estimated hand poses and the inferred contact points is then described. 3.3.1 Contact Points Computation In order to compute the contact points, we use the high resolution mesh of the hand, which has been used for hand motion capture. To this end, we compute for each vertex associated to each end-effector the distance to the closest point of the object point cloud Do. We first count for each end-effector the number of vertices with a closest distance of less than 1mm. If an end-effector has more than 40 candidate contact vertices, it is labeled as a contact bone and all vertices of the bone are labeled as contact vertices. If there are not at least 2 end-effectors selected, we iteratively increase the distance threshold by 0.5mm until we have at least two end-effectors. In our experiments, we observed that the threshold barely exceeds 2.5mm. As a result, we obtain for each frame pair the set of contact correspondences (Xhand, X0 hand) ∈ Chand(θ, Dh), where (Xhand, X0 hand) is a pair of contact vertices in the source and target frame, respectively. 5 3.4. The Main Objective of the 3D Object Reconstruction Developing this deep learning technology aims to infer the shape of 3D objects from 2D images. So, to conduct the experiment, you need the following: Highly calibrated cameras that take a photograph of the image from various angles. Large training datasets can predict the geometry of the object whose 3D image reconstruction needs to be done. These datasets can be collected from a database of images, or they can be collected and sampled from a video. By using the apparatus and datasets, you will be able to proceed with the 3D reconstruction from 2D datasets. 3.5 State-of-the-art Technology Used by the Datasets for the Reconstruction of 3D Objects The technology used for this purpose needs to stick to the following parameters: Input Training with the help of one or multiple RGB images, where the segmentation of the 3D ground truth needs to be done. It could be one image, multiple images or even a video stream. The testing will also be done on the same parameters, which will also help to create a uniform, cluttered background, or both. Output 6 The volumetric output will be done in both high and low resolution, and the surface output will be generated through parameterisation, template deformation and point cloud. Moreover, the direct and intermediate outputs will be calculated this way. Network architecture used The architecture used in training is 3D-VAE-GAN, which has an encoder and a decoder, with TL-Net and conditional GAN. At the same time, the testing architecture is 3D-VAE, which has an encoder and a decoder. Training used The degree of supervision used in 2D vs 3D supervision, weak supervision along with loss functions have to be included in this system. The training procedure is adversarial training with joint 2D and 3D embeddings. Also, the network architecture is extremely important for the speed and processing quality of the output images. Practical applications and use cases Volumetric representations and surface representations can do the reconstruction. Powerful computer systems need to be used for reconstruction. Given below are some of the places where 3D Object Reconstruction Deep Learning Systems are used: 3D reconstruction technology can be used in the Police Department for drawing the faces of criminals whose images have been procured from a crime site where their faces are not completely revealed. 7 It can be used for re-modelling ruins at ancient architectural sites. The rubble or the debris stubs of structures can be used to recreate the entire building structure and get an idea of how it looked in the past. They can be used in plastic surgery where the organs, face, limbs or any other portion of the body has been damaged and needs to be rebuilt. It can be used in airport security, where concealed shapes can be used for guessing whether a person is armed or is carrying explosives or not. It can also help in completing DNA sequences. 3.6 3D Models Solid Model Solid models deliver a 3D digital representation of an object with all the proper geometry. It is correct in all the other types, but “solid” refers to the model as a whole instead of only the surface. The object cannot be hollow. Much like all other types, solid models come from three-dimensional shapes. You can use a myriad of basic and complex shapes. Those shapes function like building blocks that work together to create a single object. You can add more material to the blocks or subtract from them. Some CAD programs use modifiers, starting with one big chunk of solid, methodically carved out as if you were physically milling the base material in a workshop. Wireframe Modeling In cases where the object features a lot of complex curves, wireframe modeling is often the method of choice. Basic building blocks of solid models basic shapes are sometimes too difficult to modify into the desired configuration and dimension. Wireframe 8 modeling allows for a smoother transition between curved edges in intricate objects. As the complexity increases, however, some drawbacks become more apparent. Surface Modeling A higher step in terms of detail is the surface model. When seamless integration among the edges and a smooth transition from one vertex to the next is required, you need higher computational power to run the right software for building a surface model. Compared to the previous two, surface modeling is more demanding, but only because it has all the capabilities to create just about every shape that would be too difficult to attain with the solid or wireframe methods. 3.7 2D Models 2D reconstruction: uses two kinds of 2D models: geometric 2D models and architectural 2D models. 9 Summary, Conclusion and Recommendation 4.1 Summary The study was carried to examine the 2d/3d object reconstruction 3D object reconstruction is the process of capturing the shape and appearance of real objects. 2d reconstruction: This is used to recreate a face from the skull with the use of soft tissue depth estimates. 4.2 Conclusion In conclusion, a 2d/3d object Reconstruction allows us to gain insight into qualitative features of the object which cannot be deduced from a single plane of sight, such as volume and the object relative position to others in the scene. Most traditional face reconstructions require a special setup, expensive hardware, predefined conditions and/or manual labor which make them impractical for use in general applications. Though recent approaches have triumphed over some of these setbacks, quality and speed are not still up to the expected levels. More realistic 3D character modeling software could be used in reconstructing the final 3D face or the default 3D model can be created from such software. 4.3 Recommendation i. The knowledge of 2d/3d object reconstruction is of importance hence the need to be taught in schools at higher levels. ii. It should be made a compulsory at all levels of higher institution. 10 Reference Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 412–420. Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1966–1974. https://github.com/natowi/3D-Reconstruction-with-Deep-Learning-Methods https://iaeme.com/MasterAdmin/Journal_uploads/IJCIET/VOLUME_8_ISSUE_12/IJCIET_ Häne, C.; Tulsiani, S.; Malik, J. Hierarchical surface prediction for 3d object reconstruction. In Proceedings of the 2017 International https://tongtianta.site/paper/68922 In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4191–4200. Kar, A.; Tulsiani, S.; Carreira, J.; Malik, J. Category-specific object reconstruction from a single image. In Proceedings of the IEEE Lu, Y.; Wang, Y.; Lu, G. Single image shape-from-silhouettes. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 3604–3613. Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Zhang, C.; Pujades, S.; Black, M.; Pons-Moll, G. Detailed, accurate, human shape estimation from clothed 3D scan sequences. 11