Uploaded by Godwin Ezekiel

2D 3D RECONSTRUCTION

advertisement
Introduction
3D imaging is otherwise widely called as stereoscopy. This technique is widely used for
creation or enhancement of the 2D image by increasing the illusion of depth with the
help of binocular vision. Almost all kinds of stereoscopic methods are based on 2
images that is one from the left view and the other one from the right view. These 2
images are then joined together to give an illusion of 3D view along with inclusion of
the depth. Nowadays 3D television is the major milestone of the visual media. In recent
years, researchers have focused on developing algorithms for acquiring images, and
converting them to 3D model using depth analysis. The third dimension can usually be
perceived only by human vision.
The eyes visualize the depth and the brain reconstructs the third dimension with the
help of various views visualized by the eyes. The researchers used this strategy for
reconstructing the 3D model from various views with the help of certain parameters of
disparity and calibration. In recent days there are special cameras which help in
capturing the 3D model of the view directly. Few examples are stereoscopic dual
camera, depth-range camera etc. These cameras usually capture the RGB component of
the image and its corresponding depth map. The depth map is defined as the function
which helps in evaluating the depth of an object at a point (i.e. at pixel position). Usually
the intensity is considered to be the depth.
3D reconstruction is one of the most complex issues of deep learning systems. There
have been multiple types of research in this field, and almost everything has been tried
on it — computer vision, computer graphics and machine learning, but to no avail.
However, that has resulted in CNN or convolutional neural networks foraying into this
field, which has yielded some success.
1
Background Study
Recovering the lost dimension during image acquisition from any normal camera has
been a hot research area in the field of computer vision for more than a decade. The
literature review shows that the research methodology has been changed from time to
time. More precisely, we can divide the conversion of 2D images to 3D model
reconstruction into three generations. The first generation learns the 3D to 2D image
projection process by utilizing the mathematical and geometrical information using
some mathematical or algorithmic solution. These types of solutions usually require
multiple images that are captured using specially calibrated cameras. For example, using
some multiview of an object with constant angle changing that can cover all the 360
degrees of an object, we can compute the geometrical points of the object. Using some
triangularization techniques, we can join these points to make a 3D model. The second
generation of 2D to 3D model conversion utilizes the accurately segmented 2D
silhouettes. This generation leads to a reasonable 3D model generation, but it requires
specially designed calibrated cameras to capture the image of the same object from
every different angle. This type of technique is not feasible or more practical because of
the complex image capturing techniques.
Humans can assume the shape of the object using prior knowledge about some objects
and predict what an object will look like from another unseen viewpoint. The computer
vision-based techniques are inspired by human vision to convert 2D images to 3D
models. With the availability of large-scale data sets, deep learning research has evolved
in 3D reconstruction from a single 2D image. A deep-belief-network-based 3D model
was proposed to learn the 3D model from a single 2D image. It is considered one of the
earlier neural-network-based data-driven models to reproduce the 3D model.
2
Analysis
3.1 Geometry-Based Reconstruction
The 3D reconstruction using geometry-based methods requires complex geometrical
information, and most of these methods are scene-dependent. A method for 3D human
reconstruction was proposed based on geometric information. Some other methods
focused on improving the quality of 3D sensory inputs such as multiview cameras and
3D scanners and then converting these data into a 3D model . However, all of these
methods required more than one view of an object to capture sufficient geometry for
3D reconstruction. For 3D reconstruction from a single 2D image, it is difficult to extract
geometrical information, making it difficult to formulate a 3D model. Moreover, we need
to preserve the depth information of the scene or object to reconstruct the model in 3D.
Scene reconstruction and modelling are two major tasks of 2D and 3D Computer
Vision. The reconstruction offers us the exact observation of the 2-dimensional and 3dimensional world, whereas, modelling allows us to perceive it accurately. Both of
these tasks have always been active areas of research due to their wide range of
potential applications, such as scene representation, under- standing, and robot
navigation.
For a moving 2D-3D camera setup, the 3D reconstruction of the scene can be
obtained by registering a sequence of point clouds with the help of Visual Odometry
(VO) measurement. However, the VO-based registration is valid only for the static scene
parts. Therefore, such reconstruction suf fers from several visual artifacts due to the
dynamic parts. In this regard, recent work by Jiang et al. [4–6] categorizes the scene
into static and dynamic parts before performing VO. Their method focuses on
improving VO measurements, and the attempted dynamic object reconstruction is
rather preliminary and naive. In this work, we focus on the high quality
3
Fig. 1: Moving Car Reconstruction from a Mobile Platform: Top are selected frames of a
moving car. Middle show the reg- istered sparse point cloud, the smoothed point cloud,
and the reconstructed mesh of the point cloud, respectively. Bottom show the fine
reconstruction in different views.
3.2 Learning-Based Reconstruction
Learning-based reconstruction approaches utilize data-driven volumetric 3D model
synthesis. The research community has leveraged improvements in deep learning to
enable efficient modeling of a 2D image into a 3D model. With the availability of large
scale data sets such as Shape Net, most researchers focus on developing a 3D voxelized
model from a single 2D image. Recently, various approaches have been proposed for
achieving this task. One study shows that a 3D morph able shape can be generated from
an image of a human face, but it requires many manual interactions and high-quality 3D
scanning of the face. Some methods suggest learning a 3D shape model from key
points or silhouettes. In some studies, the single image’s depth map is first calculated
using machine-learning-based techniques, and then a 3D model is constructed using
RGB-D images.
A convolution neural network (CNN) has recently become popular to predict the
geometry directly from a single image by using an encoder–decoder-based architecture.
The encoder extracts the features from the single image, while the decoder generates
the model based on features extracted by the encoder. In another study, deep-CNNbased models were learned, in which a single input image is directly mapped to output
4
3D representation for the 3D model generation in a single step. The authors of another
study proposed a 3D recurrent-reconstruction-neural-network (RRNN)-based technique,
in which the generation of the 3D model is performed in steps using a 2D image as
input. Some studies, such as, used a 2D image along with depth information as input to
the 3D-based U-Net architecture. For 3D appearance rendering, Groueix et al. used a
convolution encoder–decoder-based architecture to generate the 3D scene from a
single image as an input. Then, Haoqiang et al., by incorporating a differentiable
appearance sampling mechanism, further improved the quality of the generated 3D
scene.
3.3 Object reconstruction
In order to use the captured hand motion for 3D reconstruction, we have to infer the
contact points with the object. This is described in Section 4.1. The reconstruction
process based on the estimated hand poses and the inferred contact points is then
described.
3.3.1 Contact Points Computation
In order to compute the contact points, we use the high resolution mesh of the hand,
which has been used for hand motion capture. To this end, we compute for each vertex
associated to each end-effector the distance to the closest point of the object point
cloud Do. We first count for each end-effector the number of vertices with a closest
distance of less than 1mm. If an end-effector has more than 40 candidate contact
vertices, it is labeled as a contact bone and all vertices of the bone are labeled as
contact vertices. If there are not at least 2 end-effectors selected, we iteratively increase
the distance threshold by 0.5mm until we have at least two end-effectors. In our
experiments, we observed that the threshold barely exceeds 2.5mm. As a result, we
obtain for each frame pair the set of contact correspondences (Xhand, X0 hand) ∈
Chand(θ, Dh), where (Xhand, X0 hand) is a pair of contact vertices in the source and
target frame, respectively.
5
3.4. The Main Objective of the 3D Object Reconstruction
Developing this deep learning technology aims to infer the shape of 3D objects from 2D
images. So, to conduct the experiment, you need the following:

Highly calibrated cameras that take a photograph of the image from various
angles.

Large training datasets can predict the geometry of the object whose 3D image
reconstruction needs to be done. These datasets can be collected from a
database of images, or they can be collected and sampled from a video.
By using the apparatus and datasets, you will be able to proceed with the 3D
reconstruction from 2D datasets.
3.5 State-of-the-art Technology Used by the Datasets for the Reconstruction of 3D
Objects
The technology used for this purpose needs to stick to the following parameters:

Input
Training with the help of one or multiple RGB images, where the segmentation of the 3D
ground truth needs to be done. It could be one image, multiple images or even a video
stream.
The testing will also be done on the same parameters, which will also help to create a
uniform, cluttered background, or both.

Output
6
The volumetric output will be done in both high and low resolution, and the surface
output will be generated through parameterisation, template deformation and point
cloud. Moreover, the direct and intermediate outputs will be calculated this way.

Network architecture used
The architecture used in training is 3D-VAE-GAN, which has an encoder and a decoder,
with TL-Net and conditional GAN. At the same time, the testing architecture is 3D-VAE,
which has an encoder and a decoder.

Training used
The degree of supervision used in 2D vs 3D supervision, weak supervision along with
loss functions have to be included in this system. The training procedure is adversarial
training with joint 2D and 3D embeddings. Also, the network architecture is extremely
important for the speed and processing quality of the output images.

Practical applications and use cases
Volumetric representations and surface representations can do the reconstruction.
Powerful computer systems need to be used for reconstruction.
Given below are some of the places where 3D Object Reconstruction Deep Learning
Systems are used:

3D reconstruction technology can be used in the Police Department for drawing
the faces of criminals whose images have been procured from a crime site where
their faces are not completely revealed.
7

It can be used for re-modelling ruins at ancient architectural sites. The rubble or
the debris stubs of structures can be used to recreate the entire building
structure and get an idea of how it looked in the past.

They can be used in plastic surgery where the organs, face, limbs or any other
portion of the body has been damaged and needs to be rebuilt.

It can be used in airport security, where concealed shapes can be used for
guessing whether a person is armed or is carrying explosives or not.

It can also help in completing DNA sequences.
3.6 3D Models
Solid Model
Solid models deliver a 3D digital representation of an object with all the proper
geometry. It is correct in all the other types, but “solid” refers to the model as a whole
instead of only the surface. The object cannot be hollow. Much like all other types, solid
models come from three-dimensional shapes.
You can use a myriad of basic and complex shapes. Those shapes function like building
blocks that work together to create a single object. You can add more material to the
blocks or subtract from them. Some CAD programs use modifiers, starting with one big
chunk of solid, methodically carved out as if you were physically milling the base
material in a workshop.
Wireframe Modeling
In cases where the object features a lot of complex curves, wireframe modeling is often
the method of choice. Basic building blocks of solid models basic shapes are sometimes
too difficult to modify into the desired configuration and dimension. Wireframe
8
modeling allows for a smoother transition between curved edges in intricate objects. As
the complexity increases, however, some drawbacks become more apparent.
Surface Modeling
A higher step in terms of detail is the surface model. When seamless integration among
the edges and a smooth transition from one vertex to the next is required, you need
higher computational power to run the right software for building a surface model.
Compared to the previous two, surface modeling is more demanding, but only because
it has all the capabilities to create just about every shape that would be too difficult to
attain with the solid or wireframe methods.
3.7 2D Models
2D reconstruction: uses two kinds of 2D models: geometric 2D models and
architectural 2D models.
9
Summary, Conclusion and Recommendation
4.1 Summary
The study was carried to examine the 2d/3d object reconstruction
3D object reconstruction is the process of capturing the shape and appearance of real
objects. 2d reconstruction: This is used to recreate a face from the skull with the use of
soft tissue depth estimates.
4.2 Conclusion
In conclusion, a 2d/3d object Reconstruction allows us to gain insight into qualitative
features of the object which cannot be deduced from a single plane of sight, such as
volume and the object relative position to others in the scene.
Most traditional face reconstructions require a special setup, expensive hardware,
predefined conditions and/or manual labor which make them impractical for use in
general applications. Though recent approaches have triumphed over some of these
setbacks, quality and speed are not still up to the expected levels. More realistic 3D
character modeling software could be used in reconstructing the final 3D face or the
default 3D model can be created from such software.
4.3 Recommendation
i.
The knowledge of 2d/3d object reconstruction is of importance hence
the need to be taught in schools at higher levels.
ii. It should be made a compulsory at all levels of higher institution.
10
Reference
Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp.
1653–1660.
Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 412–420.
Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June
2015; pp. 1966–1974.
https://github.com/natowi/3D-Reconstruction-with-Deep-Learning-Methods
https://iaeme.com/MasterAdmin/Journal_uploads/IJCIET/VOLUME_8_ISSUE_12/IJCIET_
Häne, C.; Tulsiani, S.; Malik, J. Hierarchical surface prediction for 3d object
reconstruction. In Proceedings of the 2017 International
https://tongtianta.site/paper/68922
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Honolulu, HI, USA, 21–26 July 2017; pp. 4191–4200.
Kar, A.; Tulsiani, S.; Carreira, J.; Malik, J. Category-specific object reconstruction from a
single image. In Proceedings of the IEEE
Lu, Y.; Wang, Y.; Lu, G. Single image shape-from-silhouettes. In Proceedings of the 28th
ACM International Conference on
Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 3604–3613.
Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In
Proceedings of the IEEE Conference on
Zhang, C.; Pujades, S.; Black, M.; Pons-Moll, G. Detailed, accurate, human shape
estimation from clothed 3D scan sequences.
11
Download