Deep Learning for Camera Calibration: A Review Shrey Dixit Abstract— Camera Calibration is the first step in many Computer Vision applications like Augmented Reality. However, traditional methods require multiple images of a known pattern (like checkerboard) and can even take up to 30 minutes for calibrating one camera. This paper reviews the current literature on Deep Learning approaches for camera calibration which overcome the drawbacks of traditional approaches by using only a single image of any scene for camera calibration. In this paper, I compare different approaches and highlight the differences and similarities between them. Finally, the author discusses the shortcomings of the approaches and proposes some future steps in the direction of improving the performance of the deep learning approaches. I. Introduction The last couple of years have seen a boom in the computer vision applications. Augmented Reality is being used to teach human anatomy to medical students [ar-, 2019]. New industries are utilizing image processing for metrology to process and analyze microscopic images at the nano and microscale. From 3d reconstruction to using Augmented Reality for playing games, calibrating the camera is the first step in all these applications mentioned so far. Camera calibration in the context of image processing is the determination of the intrinsic and/or extrinsic camera parameters. The intrinsic calibration determines the optical properties of the camera lens, including the focal point, principal point, and distortion coefficients. These parameters highly depend on the camera model we use for calibration. In this review, the two approaches mentioned, use different camera models and hence estimate different intrinsic parameters. Extrinsic parameters on the other hand are the three-dimensional position and orientation of the camera’s coordinate frame in relation to a reference frame. However, traditional calibration techniques are extremely tedious and time consuming. They need multiple images of a know pattern of predefined geometry, typically a checkerboard. A person is supposed to take dozens of images of that pattern in front of the camera in different angles to correctly calibrate the camera. This process can sometime take up to 30 minutes and always requires a person with domain knowledge to correctly calibrate the camera. The performance of camera calibration techniques is an essential need for a lot of businesses and they spend a lot of money and human resource for manual calibration of the camera. The inefficiency and shortcomings of these approaches call for fully automated single image solutions for cam- era calibration. Deep Learning, especially deep convolutional neural networks have been used to automate a lot of traditional image processing pipelines in recent years. Similarly [Bogdan et al., 2018] and [Hold-Geoffroy et al., 2018] have developed deep learning solutions for fully automated single image camera calibration. Both approaches feed a single image of a general scene to convolutional neural networks and output the calibration parameters. The data used to train these models have been generated using high resolution largescale panorama dataset. II. Approach 1 The first approach "DeepCalib: A Deep Learning Approach for Automatic Intrinsic Calibration of Wide Fieldof-View Cameras" by [Bogdan et al., 2018] is a fully automatic deep learning-based approach just for intrinsic calibration. This approach builds upon the recent developments in deep Convolutional Neural Networks. A. Camera Model This approach uses a Unified Spherical Model [Mei and Rives, 2007] for the camera. They decided to use this model instead of the pinhole camera model because of the special properties it has. It is a fully reversible model which means that both the projection and the back-projection processes admit closed-form solution. It involves a single distortion parameter ξ ranged between 0 (undistorted) and 1 (highly distorted) that can model very large range of distortions. This camera model can also model a wide variety of cameras compared to other models. In the unified spherical model Fig: 1, a the world point Pw = (X, Y, Z) represented in three dimensions, is projected onto the sphere at Ps = (Xs , Ys , Zs ), also a three dimensional point. This point Ps is then finally projected on to the image plane at p = (x, y). If there were to be zero distortion then the projection for the point p would start at the center of the circle O = (0, 0, 0). To model the distortion, the projection starts at the point Oc = (0, 0, ξ). The entire process can be expressed as: p = (x, y) Yf Xf + u0 , √ + v0 ) =( √ ξ X2 + Y 2 + Z2 + Z ξ X2 + Y 2 + Z2 + Z (1) Here, the (u0 , v0 ) refer to the principal point in the image and f is the focal length. As the author discussed Intelligent Robotics Seminar 2018 This innovative approach enabled the authors to generate millions of image-label pairs of varying scenes, distortion, focal length, camera type etc. The range of the parameters used can be seen in Figure II-B. C. Network Architectures Figure 3 represents the tree network architectures the authors tried out. For each architecture, the authors also tried out solving both the regression and classification problem. For the classification problem, they used softmax as the activation function for the output layer and cross entropy as the loss function. For the regression problem, they used the sigmoid activation in the output layer and the log-cosh as the loss function. The backbone of all these architectures is based on that time’s state of the art convolutional neural metwork architecture Inception V3 [Szegedy et al., 2015]. The three architectures are as follows: 1) SingleNet: Use a single Inception V3 model to estimate both f and ξ 2) DualNet: Use two independent Inception V3 models to estimate both f and ξ 3) SeqNet: Use the first Inception V3 model to estimate f and concatenate the output to the original image and pass it to another Inception V3 model to estimate ξ. Fig. 1. The Unified Spherical Model by [Mei and Rives, 2007]: a fully reversible model with a closed form projection and backprojection equation. The distortion is modeled by a single distortion parameter ξ which can range between 0 and 1. above, the unified spherical model has a closed form solution for the inverse projection equation which can be expressed as: Ps = (ωx̂, ω ŷ, ω − ξ); ω = ξ+ p 1 + (1 − ξ 2 )(x̂2 + ŷ 2 ) x̂2 + ŷ 2 + 1 (2) and f [x̂2 , ŷ 2 , 1]T ≃ K −1 p; K = 0 0 0 f 0 u0 v0 1 (3) D. Results 1) Quantitative Evaluation: Figure 4 shows the performance of all six models for estimating both both f and ξ. The performance was measured on a subset of the dataset that was generated using the panoramic images. As it can be observed from the figure, the classification approach works better for all the three architectures tried and the first SingleNet architecture performs slightly better than the rest. Additionally, since the SingleNet uses only one Inception V3 model, it is twice as fast compared to the other two. Therefore, it is the model of choice and all the results that will be discussed for this approach later in this paper, will be in the context of this architecture. 2) User Study: It has been observed that humans ignore some amount of distortion in images. Therefore, it is difficult to judge the quality of prediction by just comparing the estimated focal length and distortion parameter to ground truth values. Henceforth, the authors conducted a user study to measure the required level of accuracy in camera calibration. The participants were asked to rate the perceived amount of distortion on a 5 point Likert-scale for a set of images. 1 meaning there is no distortion and 5 meaning that there is a very high distortion. The authors noticed a significant amount of bias for certain scenes in the results of this study. The amount of perceived distortion was much higher for images for urban scenes compared to images of natural scenes. This could be attributed to the fact that in the urban scenes there are a lot of Here K is the intrinsic calibration matrix. B. Data Generation Deep learning approaches are data hungry and computer vision supervised learning scenarios need a considerable amount of image-label pairs. Unfortunately, there exists no dataset that the authors could’ve used. Additionally, manually labeling the data in this context is infeasible considering how much time and effort it takes to manually calibrate a camera using the traditional approaches. Therefore, the authors of this approach synthetically generated a large scale dataset for camera calibration. Figure II-B demonstrates the whole data generation process. They used high resolution panorama dataset for this process. The data generation process can be divided into two parts: 1) Project the panoramic image onto a sphere. For this, we need to transform the (x, y) location of a pixel to the sphere cordinate system of (θ, ϕ). This can be done linearly by converting the range of x from (0, H) to (0, π) and y from (0, W ) to (−π/2, π/2). 2) Now using Eq:1, we project the part of the sphere back to the image plane using the desired values of the intrinsic parameters ξ and f . Shrey Dixit 2 1.0 Intelligent Robotics Seminar 2018 Fig. 2. Given an input panorama, automatically generate images with different focal lengths f and distortion values ξ, via the unified spherical model. [Bogdan et al., 2018] straight lines which, when distorted, are easy to detect for humans. Therefore, the authors had to re-conduct the study. They first manually labeled all the images used for the study to one of the three categories on the based of straight line present in the scene: 1) Urban (many lines), 2) Semi-Urban (few lines), and 3) Nature (no lines): see Figure 5. The results of the study are described in Figure 6. As one can observe from the figure, there is significantly higher amount of perceived distortion for urban images compared to natural images. It can also be seen that the predicted distortion does not suffer from the same bias as the humans do which is good for the model. They key thing to notice is that for the ground truth distortion of less than 0.2, the participants did not even consider those images distorted. Therefore, the authors concluded that since for 78% of images, the error in the distortion parameter was less than 0.2, the accuracy of the model can be said to be 78%. Another, rather unexplained, observation from the figure is that the x-axis is not linear. It starts from 0 and then to 0.025 but later is also goes from 0.4 to 0.7. The x-axis is rather logarithmic than linear. No explanation has been provided by the authors for this. 3) State-of-the-art Comparison: Figure II-D.3 shows the comparison of DeepCalib with the SOTA approaches. We can observe that DeepCalib performs much worse than most approaches but there are certain advantages of DeepCalib that can’t be overlooked. First, DeepCalib works for all the cameras shown in the comparison figure while all the other ones don’t. Another advantage is that, it only takes a single image and that too of any scene while for the others, one need to take multiple images of a know geometry. The calibration time for DeepCalib compared to other approaches is extremely low. Fig. 3. Illustration of the three network architectures: SingleNet, DualNet and SeqNet. [Bogdan et al., 2018] Fig. 4. Cumulative error distribution of estimated distortion (left) and focal length (right) with respect to ground truth. [Bogdan et al., 2018] Fig. 5. Types of Images used in User Study [Bogdan et al., 2018] E. Applications The original paper discusses two main application: 3D reconstruction and Image Undistortion in the Wild. However, here I will only discuss Image Undistortion in the wild. The process can be divided into three main steps: 1) Given an undistorted image, estimate the focal length and the distortion parameter. Fig. 6. Results of User Study (left) vs Algorithms Results (right) [Bogdan et al., 2018] Shrey Dixit 3 1.0 Intelligent Robotics Seminar 2018 Fig. 7. SOTA Comparison [Bogdan et al., 2018] Figure III-A illustrates the different angles and helps to understand the camera model that was used in the study. 2) Project that image on to the sphere using Eq. 2 and 3 and the estimated focal length and the distortion parameter. 3) Set the distortion parameter to 0 and then project the sphere back to the image plane using Eq 1 The results of this are demonstrated in Figure 8. The algorithm works very well even visibly and is able to undistort a variety of images. The second approach "A Perceptual Measure for Deep Single Image Camera Calibration" by [Hold-Geoffroy et al., 2018] is very similar to the last approach but can used for both intrinsic and extrinsic calibration. It’s also used in a wide variety of applications including virtual object insertion, image retrieval and compositing. B. Data Generation Parameter Distribution Values Focal length s = 0.8, loc = 14, Lognormal (mm) scale = 17 Horizon (im. Normal µ = 0.046, σ = 0.6 height) Roll (rad) Cauchy x0 = 0, γ ∈ 0.001, 0.1 Aspect ratio Varying 1:1, 5:4, 4:3, 3:2, 16:9 The data was generated in the same manner as [Bogdan et al., 2018] but using the Geometric Camera Model. 7 images were created from each panoramic image with the parameters sampled using the distribution mentioned in the table above. A. Camera Model C. Network and Architecture Unlike the first approach, it uses the Geometric Camera Model. Since the authors calibrated both the intrinsic and extrinsic parameters, there are three parameters to be calibrated: 1) Vertical field of view: hθ = 2arctan(h/2fpx ) where h is the image height and fpx is the focus length. 2) Horizon line midpoint: bp = 2fpx tanθ where θ is the pitch angle. 3) Yaw angle ψ The authors used a DenseNet model which was pretrained on the ImageNet dataset. The last layer of the model was replaced by three separate heads: ψ, hθ , bp . They solved the classification problem so each had 256 output neurons and used the softmac activation function. III. Approach 2 Shrey Dixit D. Results Figure III-D shows the results for this approach. The errors are lowest for the median values and becomes 4 1.0 Intelligent Robotics Seminar 2018 Fig. 8. Examples of automatic undistortion results on images in the wild. Left: Original image. Right: output of algorithm. [Bogdan et al., 2018] Fig. 9. Definition of pitch, roll, and yaw angle [Zhang et al., 2014] Fig. 10. Results for [Hold-Geoffroy et al., 2018]. Pitch (left) and roll (right) estimation performance on the HLW dataset (top) and Vertical field of view estimation performance on SUN360 test set displayed as a ”box-percentile plot” (left) and a cumulative distribution function (right). higher as the parameter values move further away from their median. This could also be attributed to the fact that the authors used smaller interval neurons in the output layers for values closer to the median. The performance of this model in terms of accuracy is better than the previously proposed single image clibration algorithms that didn’t use deep learning. However comparing this approach to the previous approach in terms of accuracy is not possible because both used different evaluation metric and different camera models. insertion. These are general applications which are also valid in case of the previous approach. 1) Image Retrieval: Images can be retrieved based their camera parameters using this approach. The authors of the original paper created a large database of images and ran their model on them to obtain the camera parameters and computed the intersection of the horizon line with the left and right image boundaries. These parameters are calculated for the query image too and the images are sorted based on their L2 distance from the query image parameters. Fig III-E.1 shows the results of image retrieval using this approach. E. Applications The original paper discusses three main applications of their approach: image retrieval, geometrically-consistent object transfer across images, and virtual 3D object Shrey Dixit 5 1.0 Intelligent Robotics Seminar 2018 Fig. 11. Examples of image retrieval. The horizon line is estimated from the query image, and used to find closest matches in a 10k random subset. The top-4 matches are shown on the right. [Hold-Geoffroy et al., 2018] Fig. 13. Examples of virtual object insertion using the camera calibration [Hold-Geoffroy et al., 2018] Fig. 12. The water tower is placed onto a picture with a detected similar horizon line. Observe how the perspective appears accurate without any alterations. [Hold-Geoffroy et al., 2018] IV. Discussion and Future Work The first approach is only for intrinsic calibration while the second one is for both intrinsic and extrinsic calibration. The first approach shows its comparison with other SOTA techniques like checkerboard while the second one does not. Both approaches are approximately different in terms of the camera model they’ve used. However, there are obvious future extensions of these approaches that should be tried and evaluated. Since the camera that needs to be calibrated can capture multiple photos, they don’t necessarily have to use a single image for calibration. The approaches can be evaluated based on how they perform on multiple images from the same camera and how it compares to when it was calibrated for only a single image. Sequence networks like Recurrent Neural Networks or Transformers can also be used for this purpose to pass in a stream of frames to predict the camera parameters, against just using the average of the camera parameters from multiple images. Additionally, since we observed that the first approach 2) Geometrically-consistent object transfer: The process of moving objects from one picture to another involves aligning the parameters. Unlike earlier methods that needed the presence of objects with known height in the image to determine the camera parameters, these deep learning approaches can infer it from the images themselves. This makes it possible to realistically transfer objects between images. An illustration of this can be seen in III-E.2 3) Virtual object insertion: Similar to the last application, the alignment of the camera parameters is crucial for inserting a 3d object inside a 2d image. With the deep learning automatic camera parameter estimation, the user only has to choose an insertion point and specify the virtual camera height. If the area surrounding the object is a flat plane that aligns with the horizon, we can automatically insert the virtual object. An illustration of this can be seen in III-E.3 Shrey Dixit 6 1.0 Intelligent Robotics Seminar 2018 works much better for urban scenes because of many lines, one can also take advantage of that. In case of AR using a projector, a projector can project some lines or a checkerboard that can be fed into the models and see how the prediction improves using extra projected lines. [Brunken and Gühmann, 2020] Brunken, H. and Gühmann, C. (2020). Deep learning self-calibration from planes. In Twelfth International Conference on Machine Vision (ICMV 2019), volume 11433, pages 980–990. SPIE. [Eser, 2020] Eser, A. Y. (2020). Opencv camera calibration. https://aliyasineser.medium.com/ opencv-camera-calibration-e9a48bdd1844. [Hold-Geoffroy et al., 2018] Hold-Geoffroy, Y., Sunkavalli, K., Eisenmann, J., Fisher, M., Gambaretto, E., Hadap, S., and Lalonde, J.-F. (2018). A perceptual measure for deep single image camera calibration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2354–2363. [Mei and Rives, 2007] Mei, C. and Rives, P. (2007). Single view point omnidirectional camera calibration from planar grids. In Proceedings 2007 IEEE International Conference on Robotics and Automation, pages 3945–3950. IEEE. [Szegedy et al., 2015] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9. [Zhang et al., 2014] Zhang, L., Fan, Q., Li, Y., Uchimura, Y., and Serikawa, S. (2014). An implementation of document image reconstruction system on a smart device using a 1d histogram calibration algorithm. Mathematical Problems in Engineering, 2014. V. Conclusion To conclude the review, one can confidently say that the deep learning approaches are not as accurate as traditional approaches. Even the authors have pointed out that the objective of these approaches is not to overcome the traditional approaches in terms of accuracy. These approaches are extremely faster and only use a single image of any general scene compared to the traditional images that use a calibration target of known geometry. References [ar-, 2019] (2019). Cae vimedixar transforms medical education with microsoft hololens. https://customers.microsoft.com/ en-us/story/718933-cae-healthcare-hololens-en. [Bogdan et al., 2018] Bogdan, O., Eckstein, V., Rameau, F., and Bazin, J.-C. (2018). Deepcalib: A deep learning approach for automatic intrinsic calibration of wide field-of-view cameras. In Proceedings of the 15th ACM SIGGRAPH European Conference on Visual Media Production, pages 1–10. Shrey Dixit 7 1.0 Eidesstattliche Erklärung Hiermit versichere ich, Shrey Dixit, an Eides statt, dass ich die vorliegende Seminararbeit mit dem Titel Deep Learning for Camera Calibration: A Review, sowie die Präsentationsfolien zu dem dazugehörigen mündlichen Vortrag ohne fremde Hilfe angefertigt und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt habe. Alle Teile, die wörtlich oder sinngemäß einer Veröffentlichung entstammen sind als solche kenntlich gemacht. Die Arbeit wurde in dieser oder ähnlicher Form noch nicht veröffentlicht, einer anderen Prüfungsbehörde vorgelegt oder als Studien- oder Prüfungsleistung eingereicht. Declaration of an Oath Hereby I, Shrey Dixit, declare that I have authored this thesis, titled Deep Learning for Camera Calibration: A Review, and the presentation slides for the associated oral presentation independently and unaided. Furthermore, I confirm that I have not used other than the declared sources / resources. I have explicitly marked all material which has been quoted either literally or by content from the used sources. This thesis, in same or similar form, has not been published, presented to an examination board or submitted as an exam or course achievement. Hamburg, February 5, 2023 Shrey Dixit