Uploaded by shrey.akshaj

seminar paper master (1)

advertisement
Deep Learning for Camera Calibration: A Review
Shrey Dixit
Abstract— Camera Calibration is the first step in
many Computer Vision applications like Augmented
Reality. However, traditional methods require multiple images of a known pattern (like checkerboard)
and can even take up to 30 minutes for calibrating
one camera. This paper reviews the current literature
on Deep Learning approaches for camera calibration
which overcome the drawbacks of traditional approaches by using only a single image of any scene for
camera calibration. In this paper, I compare different
approaches and highlight the differences and similarities between them. Finally, the author discusses
the shortcomings of the approaches and proposes
some future steps in the direction of improving the
performance of the deep learning approaches.
I. Introduction
The last couple of years have seen a boom in the
computer vision applications. Augmented Reality is being used to teach human anatomy to medical students
[ar-, 2019]. New industries are utilizing image processing
for metrology to process and analyze microscopic images
at the nano and microscale. From 3d reconstruction to
using Augmented Reality for playing games, calibrating
the camera is the first step in all these applications
mentioned so far.
Camera calibration in the context of image processing
is the determination of the intrinsic and/or extrinsic
camera parameters. The intrinsic calibration determines
the optical properties of the camera lens, including the
focal point, principal point, and distortion coefficients.
These parameters highly depend on the camera model
we use for calibration. In this review, the two approaches
mentioned, use different camera models and hence estimate different intrinsic parameters. Extrinsic parameters
on the other hand are the three-dimensional position and
orientation of the camera’s coordinate frame in relation
to a reference frame.
However, traditional calibration techniques are extremely tedious and time consuming. They need multiple images of a know pattern of predefined geometry,
typically a checkerboard. A person is supposed to take
dozens of images of that pattern in front of the camera
in different angles to correctly calibrate the camera. This
process can sometime take up to 30 minutes and always
requires a person with domain knowledge to correctly
calibrate the camera. The performance of camera calibration techniques is an essential need for a lot of businesses
and they spend a lot of money and human resource for
manual calibration of the camera.
The inefficiency and shortcomings of these approaches
call for fully automated single image solutions for cam-
era calibration. Deep Learning, especially deep convolutional neural networks have been used to automate a lot of traditional image processing pipelines
in recent years. Similarly [Bogdan et al., 2018] and
[Hold-Geoffroy et al., 2018] have developed deep learning solutions for fully automated single image camera
calibration. Both approaches feed a single image of a general scene to convolutional neural networks and output
the calibration parameters. The data used to train these
models have been generated using high resolution largescale panorama dataset.
II. Approach 1
The first approach "DeepCalib: A Deep Learning Approach for Automatic Intrinsic Calibration of Wide Fieldof-View Cameras" by [Bogdan et al., 2018] is a fully
automatic deep learning-based approach just for intrinsic
calibration. This approach builds upon the recent developments in deep Convolutional Neural Networks.
A. Camera Model
This approach uses a Unified Spherical Model
[Mei and Rives, 2007] for the camera. They decided to
use this model instead of the pinhole camera model
because of the special properties it has. It is a fully
reversible model which means that both the projection
and the back-projection processes admit closed-form solution. It involves a single distortion parameter ξ ranged
between 0 (undistorted) and 1 (highly distorted) that can
model very large range of distortions. This camera model
can also model a wide variety of cameras compared to
other models.
In the unified spherical model Fig: 1, a the world
point Pw = (X, Y, Z) represented in three dimensions,
is projected onto the sphere at Ps = (Xs , Ys , Zs ), also
a three dimensional point. This point Ps is then finally
projected on to the image plane at p = (x, y). If there
were to be zero distortion then the projection for the
point p would start at the center of the circle O =
(0, 0, 0). To model the distortion, the projection starts
at the point Oc = (0, 0, ξ). The entire process can be
expressed as:
p = (x, y)
Yf
Xf
+ u0 , √
+ v0 )
=( √
ξ X2 + Y 2 + Z2 + Z
ξ X2 + Y 2 + Z2 + Z
(1)
Here, the (u0 , v0 ) refer to the principal point in the
image and f is the focal length. As the author discussed
Intelligent Robotics Seminar 2018
This innovative approach enabled the authors to generate millions of image-label pairs of varying scenes,
distortion, focal length, camera type etc. The range of
the parameters used can be seen in Figure II-B.
C. Network Architectures
Figure 3 represents the tree network architectures the
authors tried out. For each architecture, the authors also
tried out solving both the regression and classification
problem. For the classification problem, they used softmax as the activation function for the output layer and
cross entropy as the loss function. For the regression
problem, they used the sigmoid activation in the output
layer and the log-cosh as the loss function. The backbone
of all these architectures is based on that time’s state of
the art convolutional neural metwork architecture Inception V3 [Szegedy et al., 2015]. The three architectures
are as follows:
1) SingleNet: Use a single Inception V3 model to
estimate both f and ξ
2) DualNet: Use two independent Inception V3 models to estimate both f and ξ
3) SeqNet: Use the first Inception V3 model to estimate f and concatenate the output to the original
image and pass it to another Inception V3 model
to estimate ξ.
Fig. 1. The Unified Spherical Model by [Mei and Rives, 2007]: a
fully reversible model with a closed form projection and backprojection equation. The distortion is modeled by a single distortion
parameter ξ which can range between 0 and 1.
above, the unified spherical model has a closed form
solution for the inverse projection equation which can
be expressed as:
Ps = (ωx̂, ω ŷ, ω − ξ); ω =
ξ+
p
1 + (1 − ξ 2 )(x̂2 + ŷ 2 )
x̂2 + ŷ 2 + 1
(2)
and

f
[x̂2 , ŷ 2 , 1]T ≃ K −1 p; K =  0
0
0
f
0

u0
v0 
1
(3)
D. Results
1) Quantitative Evaluation: Figure 4 shows the performance of all six models for estimating both both f and ξ.
The performance was measured on a subset of the dataset
that was generated using the panoramic images. As it can
be observed from the figure, the classification approach
works better for all the three architectures tried and the
first SingleNet architecture performs slightly better than
the rest. Additionally, since the SingleNet uses only one
Inception V3 model, it is twice as fast compared to the
other two. Therefore, it is the model of choice and all the
results that will be discussed for this approach later in
this paper, will be in the context of this architecture.
2) User Study: It has been observed that humans
ignore some amount of distortion in images. Therefore,
it is difficult to judge the quality of prediction by just
comparing the estimated focal length and distortion parameter to ground truth values. Henceforth, the authors
conducted a user study to measure the required level of
accuracy in camera calibration.
The participants were asked to rate the perceived
amount of distortion on a 5 point Likert-scale for a set of
images. 1 meaning there is no distortion and 5 meaning
that there is a very high distortion. The authors noticed
a significant amount of bias for certain scenes in the
results of this study. The amount of perceived distortion
was much higher for images for urban scenes compared
to images of natural scenes. This could be attributed
to the fact that in the urban scenes there are a lot of
Here K is the intrinsic calibration matrix.
B. Data Generation
Deep learning approaches are data hungry and computer vision supervised learning scenarios need a considerable amount of image-label pairs. Unfortunately,
there exists no dataset that the authors could’ve used.
Additionally, manually labeling the data in this context
is infeasible considering how much time and effort it takes
to manually calibrate a camera using the traditional
approaches. Therefore, the authors of this approach
synthetically generated a large scale dataset for camera
calibration.
Figure II-B demonstrates the whole data generation
process. They used high resolution panorama dataset for
this process. The data generation process can be divided
into two parts:
1) Project the panoramic image onto a sphere. For
this, we need to transform the (x, y) location of
a pixel to the sphere cordinate system of (θ, ϕ).
This can be done linearly by converting the range
of x from (0, H) to (0, π) and y from (0, W ) to
(−π/2, π/2).
2) Now using Eq:1, we project the part of the sphere
back to the image plane using the desired values of
the intrinsic parameters ξ and f .
Shrey Dixit
2
1.0
Intelligent Robotics Seminar 2018
Fig. 2. Given an input panorama, automatically generate images with different focal lengths f and distortion values ξ, via the unified
spherical model. [Bogdan et al., 2018]
straight lines which, when distorted, are easy to detect
for humans.
Therefore, the authors had to re-conduct the study.
They first manually labeled all the images used for the
study to one of the three categories on the based of
straight line present in the scene: 1) Urban (many lines),
2) Semi-Urban (few lines), and 3) Nature (no lines): see
Figure 5.
The results of the study are described in Figure 6.
As one can observe from the figure, there is significantly
higher amount of perceived distortion for urban images
compared to natural images. It can also be seen that the
predicted distortion does not suffer from the same bias
as the humans do which is good for the model. They key
thing to notice is that for the ground truth distortion
of less than 0.2, the participants did not even consider
those images distorted. Therefore, the authors concluded
that since for 78% of images, the error in the distortion
parameter was less than 0.2, the accuracy of the model
can be said to be 78%.
Another, rather unexplained, observation from the
figure is that the x-axis is not linear. It starts from 0 and
then to 0.025 but later is also goes from 0.4 to 0.7. The
x-axis is rather logarithmic than linear. No explanation
has been provided by the authors for this.
3) State-of-the-art Comparison: Figure II-D.3 shows
the comparison of DeepCalib with the SOTA approaches.
We can observe that DeepCalib performs much worse
than most approaches but there are certain advantages
of DeepCalib that can’t be overlooked. First, DeepCalib
works for all the cameras shown in the comparison figure
while all the other ones don’t. Another advantage is that,
it only takes a single image and that too of any scene
while for the others, one need to take multiple images
of a know geometry. The calibration time for DeepCalib
compared to other approaches is extremely low.
Fig. 3. Illustration of the three network architectures: SingleNet,
DualNet and SeqNet. [Bogdan et al., 2018]
Fig. 4.
Cumulative error distribution of estimated distortion
(left) and focal length (right) with respect to ground truth.
[Bogdan et al., 2018]
Fig. 5.
Types of Images used in User Study [Bogdan et al., 2018]
E. Applications
The original paper discusses two main application:
3D reconstruction and Image Undistortion in the Wild.
However, here I will only discuss Image Undistortion in
the wild. The process can be divided into three main
steps:
1) Given an undistorted image, estimate the focal
length and the distortion parameter.
Fig. 6. Results of User Study (left) vs Algorithms Results (right)
[Bogdan et al., 2018]
Shrey Dixit
3
1.0
Intelligent Robotics Seminar 2018
Fig. 7.
SOTA Comparison [Bogdan et al., 2018]
Figure III-A illustrates the different angles and helps
to understand the camera model that was used in the
study.
2) Project that image on to the sphere using Eq. 2 and
3 and the estimated focal length and the distortion
parameter.
3) Set the distortion parameter to 0 and then project
the sphere back to the image plane using Eq 1
The results of this are demonstrated in Figure 8. The
algorithm works very well even visibly and is able to
undistort a variety of images.
The second approach "A Perceptual Measure
for Deep Single Image Camera Calibration" by
[Hold-Geoffroy et al., 2018] is very similar to the
last approach but can used for both intrinsic and
extrinsic calibration. It’s also used in a wide variety of
applications including virtual object insertion, image
retrieval and compositing.
B. Data Generation
Parameter
Distribution Values
Focal length
s = 0.8, loc = 14,
Lognormal
(mm)
scale = 17
Horizon (im.
Normal
µ = 0.046, σ = 0.6
height)
Roll (rad)
Cauchy
x0 = 0, γ ∈ 0.001, 0.1
Aspect ratio
Varying
1:1, 5:4, 4:3, 3:2, 16:9
The data was generated in the same manner as
[Bogdan et al., 2018] but using the Geometric Camera
Model. 7 images were created from each panoramic image with the parameters sampled using the distribution
mentioned in the table above.
A. Camera Model
C. Network and Architecture
Unlike the first approach, it uses the Geometric Camera Model. Since the authors calibrated both the intrinsic
and extrinsic parameters, there are three parameters to
be calibrated:
1) Vertical field of view: hθ = 2arctan(h/2fpx ) where
h is the image height and fpx is the focus length.
2) Horizon line midpoint: bp = 2fpx tanθ where θ is
the pitch angle.
3) Yaw angle ψ
The authors used a DenseNet model which was pretrained on the ImageNet dataset. The last layer of the
model was replaced by three separate heads: ψ, hθ , bp .
They solved the classification problem so each had 256
output neurons and used the softmac activation function.
III. Approach 2
Shrey Dixit
D. Results
Figure III-D shows the results for this approach. The
errors are lowest for the median values and becomes
4
1.0
Intelligent Robotics Seminar 2018
Fig. 8.
Examples of automatic undistortion results on images in the wild. Left: Original image. Right: output of algorithm.
[Bogdan et al., 2018]
Fig. 9. Definition of pitch, roll, and yaw angle [Zhang et al., 2014]
Fig. 10. Results for [Hold-Geoffroy et al., 2018]. Pitch (left) and
roll (right) estimation performance on the HLW dataset (top)
and Vertical field of view estimation performance on SUN360 test
set displayed as a ”box-percentile plot” (left) and a cumulative
distribution function (right).
higher as the parameter values move further away from
their median. This could also be attributed to the fact
that the authors used smaller interval neurons in the
output layers for values closer to the median. The performance of this model in terms of accuracy is better
than the previously proposed single image clibration
algorithms that didn’t use deep learning. However comparing this approach to the previous approach in terms
of accuracy is not possible because both used different
evaluation metric and different camera models.
insertion. These are general applications which are also
valid in case of the previous approach.
1) Image Retrieval: Images can be retrieved based
their camera parameters using this approach. The authors of the original paper created a large database of
images and ran their model on them to obtain the camera
parameters and computed the intersection of the horizon
line with the left and right image boundaries. These
parameters are calculated for the query image too and
the images are sorted based on their L2 distance from the
query image parameters. Fig III-E.1 shows the results of
image retrieval using this approach.
E. Applications
The original paper discusses three main applications of
their approach: image retrieval, geometrically-consistent
object transfer across images, and virtual 3D object
Shrey Dixit
5
1.0
Intelligent Robotics Seminar 2018
Fig. 11. Examples of image retrieval. The horizon line is estimated from the query image, and used to find closest matches in a 10k
random subset. The top-4 matches are shown on the right. [Hold-Geoffroy et al., 2018]
Fig. 13. Examples of virtual object insertion using the camera
calibration [Hold-Geoffroy et al., 2018]
Fig. 12. The water tower is placed onto a picture with a detected
similar horizon line. Observe how the perspective appears accurate
without any alterations. [Hold-Geoffroy et al., 2018]
IV. Discussion and Future Work
The first approach is only for intrinsic calibration
while the second one is for both intrinsic and extrinsic
calibration. The first approach shows its comparison
with other SOTA techniques like checkerboard while the
second one does not. Both approaches are approximately
different in terms of the camera model they’ve used.
However, there are obvious future extensions of these
approaches that should be tried and evaluated. Since the
camera that needs to be calibrated can capture multiple
photos, they don’t necessarily have to use a single image
for calibration. The approaches can be evaluated based
on how they perform on multiple images from the same
camera and how it compares to when it was calibrated
for only a single image. Sequence networks like Recurrent
Neural Networks or Transformers can also be used for
this purpose to pass in a stream of frames to predict the
camera parameters, against just using the average of the
camera parameters from multiple images.
Additionally, since we observed that the first approach
2) Geometrically-consistent object transfer: The process of moving objects from one picture to another
involves aligning the parameters. Unlike earlier methods
that needed the presence of objects with known height
in the image to determine the camera parameters, these
deep learning approaches can infer it from the images
themselves. This makes it possible to realistically transfer
objects between images. An illustration of this can be
seen in III-E.2
3) Virtual object insertion: Similar to the last application, the alignment of the camera parameters is crucial
for inserting a 3d object inside a 2d image. With the
deep learning automatic camera parameter estimation,
the user only has to choose an insertion point and specify
the virtual camera height. If the area surrounding the
object is a flat plane that aligns with the horizon, we can
automatically insert the virtual object. An illustration of
this can be seen in III-E.3
Shrey Dixit
6
1.0
Intelligent Robotics Seminar 2018
works much better for urban scenes because of many
lines, one can also take advantage of that. In case of AR
using a projector, a projector can project some lines or
a checkerboard that can be fed into the models and see
how the prediction improves using extra projected lines.
[Brunken and Gühmann, 2020] Brunken, H. and Gühmann, C.
(2020). Deep learning self-calibration from planes. In Twelfth
International Conference on Machine Vision (ICMV 2019),
volume 11433, pages 980–990. SPIE.
[Eser, 2020] Eser, A. Y. (2020).
Opencv camera
calibration.
https://aliyasineser.medium.com/
opencv-camera-calibration-e9a48bdd1844.
[Hold-Geoffroy et al., 2018] Hold-Geoffroy, Y., Sunkavalli, K.,
Eisenmann, J., Fisher, M., Gambaretto, E., Hadap, S., and
Lalonde, J.-F. (2018). A perceptual measure for deep single image camera calibration. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 2354–2363.
[Mei and Rives, 2007] Mei, C. and Rives, P. (2007). Single view
point omnidirectional camera calibration from planar grids. In
Proceedings 2007 IEEE International Conference on Robotics
and Automation, pages 3945–3950. IEEE.
[Szegedy et al., 2015] Szegedy, C., Liu, W., Jia, Y., Sermanet,
P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and
Rabinovich, A. (2015). Going deeper with convolutions. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 1–9.
[Zhang et al., 2014] Zhang, L., Fan, Q., Li, Y., Uchimura, Y., and
Serikawa, S. (2014). An implementation of document image
reconstruction system on a smart device using a 1d histogram
calibration algorithm. Mathematical Problems in Engineering,
2014.
V. Conclusion
To conclude the review, one can confidently say that
the deep learning approaches are not as accurate as traditional approaches. Even the authors have pointed out
that the objective of these approaches is not to overcome
the traditional approaches in terms of accuracy. These
approaches are extremely faster and only use a single
image of any general scene compared to the traditional
images that use a calibration target of known geometry.
References
[ar-, 2019] (2019). Cae vimedixar transforms medical education
with microsoft hololens. https://customers.microsoft.com/
en-us/story/718933-cae-healthcare-hololens-en.
[Bogdan et al., 2018] Bogdan, O., Eckstein, V., Rameau, F., and
Bazin, J.-C. (2018). Deepcalib: A deep learning approach for
automatic intrinsic calibration of wide field-of-view cameras. In
Proceedings of the 15th ACM SIGGRAPH European Conference
on Visual Media Production, pages 1–10.
Shrey Dixit
7
1.0
Eidesstattliche Erklärung
Hiermit versichere ich, Shrey Dixit, an Eides statt, dass ich die vorliegende
Seminararbeit mit dem Titel Deep Learning for Camera Calibration: A Review,
sowie die Präsentationsfolien zu dem dazugehörigen mündlichen Vortrag ohne fremde
Hilfe angefertigt und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt
habe.
Alle Teile, die wörtlich oder sinngemäß einer Veröffentlichung entstammen sind als solche
kenntlich gemacht.
Die Arbeit wurde in dieser oder ähnlicher Form noch nicht veröffentlicht, einer anderen
Prüfungsbehörde vorgelegt oder als Studien- oder Prüfungsleistung eingereicht.
Declaration of an Oath
Hereby I, Shrey Dixit, declare that I have authored this thesis, titled Deep Learning
for Camera Calibration: A Review, and the presentation slides for the associated oral
presentation independently and unaided. Furthermore, I confirm that I have not used
other than the declared sources / resources.
I have explicitly marked all material which has been quoted either literally or by content
from the used sources.
This thesis, in same or similar form, has not been published, presented to an examination
board or submitted as an exam or course achievement.
Hamburg, February 5, 2023
Shrey Dixit
Related documents
Download