Towards a multimodal stereo rig for ADAS: State of art and algorithms jfbarrera@cvc.uab.cat Barrera Campo Fernando Centre de Visió per Computador Universitat Autònoma de Barcelona (UAB) 08193 Bellaterra (Barcelona). Catalonia, Spain Advisors: felipe.lumbreras@cvc.uab.cat angel.sappa@cvc.uab.cat Lumbreras Felipe Sappa Angel D. Abstract The current work presents a survey on state of the art on multimodal stereo rig for ADAS applications. The proposed survey is organized following a functional modular architecture. The basic requirements for each module have been identified trying to adapt current stereo algorithms to the proposed multimodal framework. All the implemented algorithms have been validated using ground truth data (classical stereo data). Preliminary results on a multimodal stereo rig are presented and main challenges still open are identified. Keywords: Multimodal, stereo vision, infrared imagery, and 3D reconstruction. 1. Introduction In the most recent study published by the World Health Organization (WHO, 2009) about road security, it has been concluded that “More than 1.2 million people die on the world roads every year, and as many as 50 million others are injured”. Mainly, the deaths occur in low-income and middle-income countries. In an attempt to remedy this situation the WHO periodically issues reports on road traffic injury prevention; it is a set of recommendations on how countries could improve road safety and reduce the death toll on their roads. The WHO aims to help emerging and developing economies to beat this situation. However, all these efforts have not been completely effective, and it still has much work to do. According to projections of the WHO for the 2030, the road traffic injuries will be the fifth leading cause of death in the world, around 3.6%, going up from ninth position in 2004 with 2.2%. This means that the number of injured in roads may overcome the amount of sick with either stomach cancer or HIV/AIDS. For this reason, during the last few years the automotive industry and many research centers are trying to develop Advanced Driver Assistance Systems or ADAS that increase road safety. The first step to accomplish this goal is to identify risky factors that influence crash involvement; these were classified by WHO (2004) in 10 categories: speed, pedestrian and c 2009 Fernando Barrera, Felipe Lumbreras, and Angel D. Sappa. cyclist, young drivers and riders, alcohol, medicinal and recreational drugs, drive fatigue, hand-held mobile phones, inadequate visibility, road-related factors, and vehicle-related risk factors, which are generally out of control. The above factors have different repercussions in the road safety, notice that, the statistics vary according to the country, as well as cultural, social, and economic aspects. These situations lead to many solutions and proposals in accordance with the problem to face, though atmospheric conditions like: precipitation, fog, rain, snow, wind, and other. These contribute to poor visibility, and therefore, it is considered a cause of crash involvement. Statistics show that driving at night is riskier in terms of crash involvements per distance traveled than driving during the day. This is due to the more prevalent use of alcohol by drivers at night, the effect of fatigue on the driving task, and the risk associated with the visibility reduction (Keall et al., 2005). In Europe nearly 50% of fatal car accidents take place at the night. Then, the risk of an accident is twice for nocturnal driving in comparison to do it during the day. The drivers could not see other road users, the darkness, raining, or fog prevents it react in time for avoiding the collision, even under high visibility the road traffic incident continues happening. Recent advances in Infra-Red (IR) imaging technology has allowed its use in applications beyond either military or law enforcement domain. Additionally, new IR commercial vision systems have been included in diverse technical and scientific applications. For instance, it is used in airports, coastal, and park safety. This type of sensor offers features that facilitate tasks such as detection of pedestrians, hot spots, differences in temperature, among others. This can significantly improve the performance of those systems where the persons are expected to play the principal role, for example: video surveillance applications, monitoring, and pedestrian detection. Nowadays, the literature about recovery of 3D scene structure suggests many models and techniques, its selection depends on different factors. Gheissari and Bab-Hadiashar (2008) proposed a set of criteria that affect the suitability of a model to the application, for instance: the nature of physical constrains, performance, scale and distribution of noise, complexity, among others. A relation undeniable is the established between the model and the kind of sensor, because it provides the initial information. In this sense, the automotive industry has selected Electro-Optical sensors (EO)1 as preferred ones, mainly, due to its low costs. Between these devices the IR cameras offers best capabilities, thus they will became the de facto standard for imagery at night time. This novel technology in the ADAS domain requires new methods or adaptation of previous ones, supposing a challenge and an opportunity. So a classic dichotomy in 3D reconstruction arises, stereo versus monocular systems. Although, during the last decade many valuable academic researches were published in journals, such as Saxena et al. (2009) that considered the problem of estimating 3D structure from a single image, this was possible joining Markov Random Field (MRF) and supervised learning theories. This enable to monocular systems to drive autonomous vehicles through unstructured in/outdoor environments; it is a complex task that requires a prior knowledge of the environment (Michels et al., 2005). A drawback is the way to estimate the relative 1. Electro-Optical sensors or EO includes night vision, infrared cameras, CCD/CMOS cameras, and proximity sensor. In general, it is a technology involving components, devices and systems, which operate by modification of the optical properties of a material by an electric field. 2 depths, while a well calibrated stereo rig can directly compute it from the pair, a monocular system must detect and track features, and then estimate the depth from their motion. Similar discussion was formulated in (Sappa et al., 2008). In conclusion, each one has its own advantages and disadvantages but today stereo systems offer a balance between accuracy, speed, and performance. At this point, it is possible to state the next question: Could a couple of sensors measuring different bands of the electromagnetic spectrum, as the visible and infrared, form a part of an Advanced Driver Assistance System, and help to reduce the amount of injuries and deaths in the roads? It is not an easy question but, the current work will show an approach toward a system of these characteristics; the main objective is to find advantages drawbacks, pitfalls, and potential opportunities. The fusion of data coming from different sensors, as the emissions registered at visible and infrared band, represents a special problem because it has been showed that the signals are weak correlated. Indeed, mild anti-correlation between visible and Long-Wave Infrared (LWIR) (Scribner et al., 1999). Therefore, many traditional techniques of image processing are not helpful, and require adjustments for correctly performing in each modality. Night vision systems started out as a military technology and its usefulness was always related to defense and national security topics. But, this conception changed due to good results reached in other knowledge area, for instance, ADAS. Nowadays, engineers have to simplify the IR devices for typical ADAS, thus several vehicles includes IR systems, and other Electro Optical sensors (EO) directly from manufactures. Notice that, this capacity has been installed, but it is not exploited beyond of simple thermal visualization of path. In order to address the above question the current work study the state of art in stereo vision compounds of visible and infrared spectrum sensors, and propose a framework to evaluate the best algorithms, towards a multi modal stereo system. This is organized as follows. We introduce the state of art, early researches in multimodal stereo. Next, it is summarized the main theoretical concepts about infrared technology, development, and its applications. As well as the formulation and equations that model a stereo system. Finally, it is presented a multimodal stereo system and its constitutive stages. The developmrnt of a multimodal stereo system is a complex task, for this reason, a traditional stereo system, based on color images, is initially analysed. Next, this framework is extended, giving support to another modality, thermal information. This incremental methodology was followed because the functions and approaches implemented need validation, then the metrics and protocols previously published are reused for measure the error, until a ground truth for multimodal is developed. 2. State of art The three-dimensional scene structure recovery has been an intense area of research for decades, but to our knowledge, there are a few works on depth estimation from infrared and visible spectrum imagery. This section reviews latest advances in multimodal stereo. Also, it is proposed a coarse grained classification of the reviewed multimodal systems. A review of the cited methods allows us to identify two types of data fusion: raw fusion, also referred as early, and high level fusion (late fusion). These concepts are commonly used in other computer vision fields. The raw fusion directly uses the sensing measurements to 3 evaluate a dissimilarity function. On the contrary, in the case of a high level fusion the data are not immediately processed. They are transformed to another representation of higher level of abstraction. Hence, this scheme needs more pre-processing than raw fusion. Several implementations of multimodal stereo systems with raw fusion could be found. Though, they share a common drawback. Until the moment, it has not been proposed a suitable function of dis/similarity that matches IR emission or thermal information with intensity. In order to face this problem several methods bounded the searching space into the images. The most frequent way is to define regions of interest (ROI); a slicing window with fixed size is moved over the region. Then, a similarity function is evaluated and its result stored for future operations. The idea is to measure the quality of the matching over regions, for instance: it is hoped that hot spots in the infrared images correspond to pedestrians. If the segmentation of images is stable on both modalities (infrared and visible spectrum), the pedestrian shapes are detected in it (those seem like equal), and its value of similarity will be maximum over corresponding regions. The last one is a conceptual example, but not always is necessary segmentation, and it could be replaced by a feature detector, as the blob detector, or whatever key point detector. The only restriction is that inputs of stereo algorithm must be raw values. The strategies discussed above increase the matching precision, but also the prior knowledge of scene improves the overall performance. For instance, algorithms that search certain shapes (heads, pedestrians, or vehicles) allow a voting schema, the regions with highest vote count are considered the best matching (Krotosky and Trivedi, 2007a). The multimodal fusion schemes, with a high level of abstraction, delay the data association until is pre-processed. In general, these approaches process each modality, as if they were independent streams; the motivation is simple, it is wished to avoid the registration or correspondence stage on the sensing data, for which it transforms the measurements to a common representation, for example, vectors of optical flow, shape descriptors, or regions. In recent years significant progress has been made in multimodal stereo for ADAS applications, in spite of the difficulty of acquire infrared devices2 . The papers based on two modalities (infrared and visible spectrum) are summarized in Table 1. Tasks such as registration and sensor fusion largely concentrate the researcher interest because they are required in whatever application. Notice that these are the first challenges to solve. The table 1 shows a list of researches in multimodal stereo; the most relevant and recent are discussed below. Mainly, those related to registration because are close to the aim of the current work. Krotosky and Trivedi (2007a) report a multimodal stereo system for pedestrian detection consisting of three different stereo pairs: (i) two color cameras and two IR cameras; (ii) two color and a single IR cameras; and (iii) a couple color and infrared cameras. The stereo rig, among color and infrared cameras are aligned in pitch, roll, and yaw. (i )—A configuration of two color and two infrared cameras is analyzed. In order to get a dense disparity map from both color-stereo and infrared-stereo imageries the correspondence algorithm by Konolige (1997) has been used. This approach is based on area correlation, so a Laplacian Of Gaussian (LOG) filter transforms the gray levels in the image to directed edge intensities over some smoothed region, the degree of similarity between the filtered imagery is measured with the Absolute Difference (AD) function. In 2. Infrered cameras can be bought only after being registered in U.S Department of Commerce, Bureau of Industry and Security. 4 Table 1: Review of multimodal stereo IR/VS rig applications. Applications Description References Registration Geometrically align images of the same scene taken by different sensors (modalities). (Kim et al., 2008),(Morin et al., 2008),(Zheng and Laganiere, 2007),(Istenic et al., 2007),(Zhang and Cui, 2006),(Krotosky and Trivedi, 2006),(Hild and Umeda, 2005),(Segvic, 2005),(Kyoung et al., 2005), and (Li and Zhou, 1995) Detection Detection of human shape into the visible spectrum and infrared imageries. (Krotosky and Trivedi, 2007a) (Han and Bhanu, 2007),(Bertozzi et al., 2005), and (Krotosky and Trivedi, 2007a). Tracking Detection and location of moving people (pedestrian, hot spots, or other body parts) in time using multiple modalities. (Trivedi et al., 2004), (Krotosky et al., 2004), (Krotosky and Trivedi, 2008), and (Krotosky and Trivedi, 2007b). 3D extraction Thermal mapping. (Prakash et al., 2006) order to optimize results the disparity range is quantized into 64 levels, the areas with low texture are detected with an interest operator (Moravec, 1979), and rejected. Finally, the unfeasible matches are detected and suppressed under left-right consistency constraint. So, the above algorithm for disparity computation is applied as much color as infrared pair. The author’s aim is getting a first approximation to the problem because it used a wellknown real time algorithm. (ii)—A multimodal stereo setup is presented; it is composed of a color-stereo rig paired with a single infrared camera. A dense stereo matching algorithm corresponding the color imageries is considered. For this purpose, the above explained method (Konolige, 1997) is used. Next, it is used a trifocal framework to register the color and infrared imageries with the disparity map. (iii)—The last multimodal stereo approach to pedestrian detection is a system with raw fusion in opposition to the ones introduced above. The authors use a previous implementation which is based in Mutual Information (Krotosky and Trivedi, 2007b). They propose to match regions in cross-spectral stereo images. The matching occurs by fixing a correspondence window along one reference image in the pair and sliding the window along the second image. The quality of the match between the two correspondence windows is obtained by measuring the mutual information between them. The weak correlation between the images implies high bad matching rate, then an additional restriction must be included. Assuming that the pedestrian’s shape belongs to at a specific plane. Hence, it is possible to define a voting matrix where a poor match would be widely distributed across a large number of different disparity values, whereas the disparity variation is small for a good matching. 5 Prakash et al. (2006) introduce a method that recover the 3D temperature map from a pair of thermal calibrated cameras. Firstly, the cameras are calibrated with a Matlab toolbox (Bouguet, 2000), hence intrinsic and extrinsic parameters are computed. The next task is to solve the correspondence problem. A surface with isotherms3 lines is assumed in order to match points. The main drawback of this approach is to suppose different temperature gradient on the object surface, otherwise the similarity measure will not be enough distinctive to detect correct matching. The used cost function is sum of squared difference or SAD. 3. Background The study of the nature of light has always been the most important source of progress in physics, as well as for the computer vision. This section introduces the basic concepts to understand infrared sensors, its evolution, physic properties, and its usefulness for ADAS. Later, also it is presented some ways for classifying of stereo systems, components, geometry models derived from camera location, and depth computation. 3.1 Infrared sensors: history, theory and evolution In 1800 the astronomer Sir Frederick William Herschel experimented with a new form of electromagnetic radiation, which was called infrared radiation. He built a crude monochromator, and measured the distribution of energy in sunlight. The Herschel’s experiment additionally showed the existence of a relation between temperature and color. In order to show this, he let that the sunlight passing through a glass prism. Clearly, it was dispersed into its constituent spectrum of colors. Next, an array of thermometers with blackened bulbs was used for measuring various color temperatures. It is hoped that the equipments used by Herschel’s have been improved several times due to new alloys and materials. Now there are two general classes of detectors: photon (or quantum) and thermal. In the first class of detectors the radiation is absorbed within the material by interaction with electrons. The observed electrical output signal results from the changed electronic energy distribution. These electrical property variations are measured to determine the amount of incidents optical power. Depending on the nature of interaction the photon detectors are divided into different types, the most important: intrinsic detector, extrinsic detectors, photoemissive detectors, and quantum. They have high performance but require cryogenic cooling. Therefore, IR systems based on semiconductor photodetectors are heavy, expensive and inconvenient to many applications, specially ADAS. In a thermal detector the incident radiation is absorbed by a material semiconductor, causing a change in the temperature of the material, or other physical property, its resultant change is used to generate an electrical output proportional to incident radiation. In this kind of sensor is necessary that at least one inherent electrical property change with temperature, and it could be measured. A traditional device used for thermal detector of low-cost are bolometers, they turn an incoming photon flux into heat, changing the electrical resistance of the detector element, whereas in a pyroelectric detector, for example, this flux changes the internal spontaneous polarization. Currently, thermal detectors are avail3. Points of equal temperature. 6 able for commercial applications, opposite to the based on photons, which are restricted to military uses. In contrast to photon detectors, the thermal detectors do not require cooling. In spite of it, the photon detectors are popularly believed to be rather speed, and selective wavelength in comparison with another type of detectors, this fact was exploited by the military industry. But, later in the ’90s, advances in micro miniaturization allowed arrays of bolometers or thermal detectors. It compensated the moderate sensitivity and low frame rate of thermal detectors. Large arrays let high quality imagery and good response time, also the manufacture cost to drop quickly (an extensive review in Rogalski (2002)). Infrared by definition refers to that part of the electromagnetic spectrum between the visible and microwave region, and their behavior is modeled for the next equations: υ= c , λ (1) where c is the speed of light, 3 × 1012 (m/sg); υ is the frequency (hz), and λ its wavelength (m). The energy is related to wavelength and frequency by the following equation: E = hυ = hc , λ (2) where, h is Planck constant, equal to 6.6 × 10−34 (joules sg). Notice that, light and electromagnetic waves of any frequency will heat surfaces that absorb them, and the infrared detectors measure the emissivity in this band, but it could occur in other bands, depending on physical properties of the objects (constitutive material). Humans at normal body temperature mainly radiate at wavelengths around 10μm, it corresponds to Long-Wave InfraRed band or LWIR (see the table 2). Table 2: General spectral bands based on atmospheric transmission and sensor technology Spectral Band Spectral Wavelenght (μm) Visible 0.4 - 0.7 Near InfraRed (NIR) 0.78 - 1.0 Short-Wave InfraRed (SWIR) 1-3 Mid-Wave InfraRed (MWIR) 3-5 Long-Wave InfraRed (LWIR) 8 - 12 The use of night vision devices should not be confused with thermal imaging. While, night vision devices convert ambient light photons into electrons, which are then amplified by a chemical and electrical process and then converted back into a visible ray of light. The thermal sensors create images detecting radiation that emits the objects. 3.2 Night vision in ADAS Night vision is a technology that was originated in military applications for producing a clear image on the darkest of night. As it was explained above, they need no light whatsoever to operate, and also have the ability to see through special conditions such as: fog, rain, haze, or smoke. Thus, it would be interesting for ADAS since road users can avoid potentially hazards. 7 Researchers have always been convinced that thermal imaging is an extremely useful technology. Nowadays, it can be found vehicles with IR equipment. This tendency is being followed by different car manufacturer. Then, new technical requirements have been formulated. Today, there are two different technologies on the market: One is called active, using near infrared laser and detectors, and the other passive, which only uses thermal infrared detector (Ahmed et al., 1994). The difference is notorious. Active systems beams infrared radiation into the area in front of the vehicle, for this purpose, usually it involves laser sources or just a light bulb in the near infrared range (NIR). Then, infrared radiation is reflected by objects, the road, humans and other road users. Later, the reflections are captured using a camera sensitive to a same region of the spectrum that was emitted, for example, a NIR camera. Whereas passive systems register relatives differences in heat, or infrared radiation emitted in the far infrared band (FIR), and it does not need a separate light source. The selection of the best night vision system for ADAS is not easy, different factors must be considered. Although, both systems are technically and economically feasible, the passive systems based on FIR offer advantages. It is not dependent on the power of the infrared beams, because those are not necessary. Then, it contains less components so it is less susceptible to breakdowns. FIR detects people and hot spot at a longer range. The major advantage of FIR is that it is not sensitive to the headlight of oncoming traffic, street lights and powerfully reflecting surfaces such as traffic signs. Since NIR systems (or passives) are based on the use of light beams with wavelength close to visible spectrum, two facts can happen. Firstly, the driver can be blinded for light ray reflection or dispel. Or, if an object is illuminated by two or more infrared beams, this could appear brightly on the screen. The worse case is when an infrared source directly illuminates a detector, situation frequent by the glare of oncoming cars (FLIR, 2008) (Schreiner, 1999). The setup of IR systems, in the context of ADAS, is another interesting topic to be mentioned. It includes camera position, display, and applications. They are discussed in more detail in next section: 3.2.1 Camera position The location of the sensor or camera is critical to obtain an acceptable image of the road. If the sensor or camera is positioned low (e.g., in the grill), the perspective of the road will be less than ideal, especially when driving on vertical curves. It is acceptable to position the sensor at the driver eye height, and it is preferable to place it above the driver’s eyes. Another aspect of camera position is that a lower position is more exposed to dirt. Glass interferes with the FIR wavelengths and cannot be placed in front of the sensor. Thus, the FIR sensor cannot be placed behind the windshield. However, early research, as the performed by BMW, concluded that the best position is at the left of the front bumper. This result could be contradictory but a new generation of FIR sensors is being developed, especially for ADAS. Table 3 presents the key points of researches performed by five car manufacturer. Other examples are: Renault NIR-contact analogue, which is placed at the inside rear view mirror, and the Daimler-Chrysler camera (NIR), which is placed high above the driver’s eyes (rear mirror). 8 Manufacturer General Motors and Volvo Fiat and Jaguar Autoliv Table 3: IR systems. Technology and setup FIR camera mounted behind the front grill and cover by a protective window. NIR camera placed just above the head of the driver (rear mirror) and light source is over the bar at the front of the car. FIR camera placed at the lower end of the windshield. Camera specification Raytheon IR-camera. Maximum sensitivity at 35◦ C. Field of view horizontally 11.25◦ and vertically 4◦ . The detection range for a pedestrian is 300 m. Active system NIR. Field of view - horizontally 45◦ . The detection range for a pedestrian is 150 m. Active system NIR. Field of view - horizontally 45◦ . The detection range for a pedestrian is 150 m. 3.2.2 Display and applications Initially, it is explored the feasibility that the systems of night vision use a mirror and projector over the dashboard and lower part of the driver’s windshield, this unit project real-time thermal images, which appears to float above the hood and below the driver’s line of sight. Perhaps, this visualization of the system is good, but to include these devices in the vehicles demand the development of expensive technologies and the users will not pay by them. A more realistic system is currently used in many vehicles; it consists of a liquid crystal display (LCD) embedded in the middle dashboard. The driver check the thermal images and other applications supplied by the vehicle computer. The current commercial applications, based on infrared images, are limited to display a stream of images, which correspond to events registered in real time by sensors. Although, to develop a system in real time is not simple task, the only operation of image processing is contrast enhancement. Recently, new research lines in night vision develop software that can identify pedestrian or critical situations. 3.3 Stereo vision Computational stereo refers to the problem of determining three-dimensional structure of a scene from two or more images taken from distinct viewpoints. It is a well-known technique to obtain depth information by optical triangulation. Other examples are: stereoscopy, active triangulation, depth from focus, and confocal microscopy. Stereo algorithms could be classified according to different criteria. A taxonomy for stereo matching is presented by Scharstein et al. (2001); they propose to categorize them into two groups, which will be explained briefly. Local methods attempt to match a pixel with its corresponding one in the other image. These algorithms find similarities between connected pixels through its neighborhood, surrounding pixels provide the information to 9 identify matches. Local methods are sensitive to noise, and ambiguities, such as: occluded regions, regions with uniform texture, repeated patterns, changes of view point or illumination. Global methods can be less sensitive to the mentioned problems since high-level descriptors provide additional information for ambiguous regions. These methods formulate the problem of matching in mathematical terms, more than local methods, which allows to introduce restrictions that model surfaces or maps of disparity. For instance: smoothness, continuity, among others. Nowadays, it is a still open research topic to find the best conditions, restrictions, or primitives to decrease the percentage of bad matching pixels. Some methods use heuristic rules, or functional to do it. Their main advantage is that scattered maps of disparity can be completed. This is performed by techniques such as: dynamic programming, intrinsic curve, graph cuts, nonlinear diffusion, belief propagation, deform model, and any other optimization or search procedure4 . The existing algorithms also are categorized into different groups, for instance, depending on the number of input images: multiple images or single image. In the first case, the images could be taken either by multiple sensors with different view points or by a single moving camera (or moving the scene, and holding the sensor fixed). Another classification could be obtained according to number of used sensors: monocular, bifocal, trifocal, and multi-ocular. The figure 1 shows a generic binocular system with nonverged geometry5 . The fundamental basis for stereo is the fact that every point in three-dimensional space is projected to a unique location in the images (see figure 1). Therefore, if it is possible to correspond the projections of a scene point in the images (IL and IR ), then it is certain that its spatial location on a world coordinate system O will be recovered. IR IL yR’ Y PL xL PR’ xR’ CR X O yL CL fR Z fL P Figure 1: A stereo camera setup. Assuming that: PL and PR are the projections of the 3D point P on the left and right images, and OL and OR are the optical center of cameras, on which two reference coordinate systems are centered (see figure 2). If also, a pinhole model for the cameras are supposed, and that the image plane arrays are made up of a perfect rectangular grid aligned. Then, the line segment CL CR is parallel to the x coordinate axis of both cameras. Under this 4. It refers to choosing the best element from some set of available alternatives. 5. Camera principal axes are parallel. 10 particular configuration the point P is defined by the intersecting ray from the optical centers OL and OR through their respective images P : PL and PR . P Z CL x x’ PL PR’ CR K OL T OR Figure 2: The geometry of nonvenged stereo. The depth Z is defined by a relationship of similarity between the triangles ΔOL CL PL with ΔP KOL , and ΔOR CR PR with ΔP KOR . CL PL CL OL = OL K KP By similar triangles- ΔOL CL PL with ΔP KOL , = OR K KP By similar triangles- ΔOR CR PR with ΔP KOR , CR PR CR OR ∴ KP = CL OL × Or Z = f OL K + OR K CL PL + CR PR T , d from 3 and 4, with CL OL = CR OR , (3) (4) (5) (6) where d is the disparity or displacement of a projected point in one image with respect to the other; in the nonverged geometry depicted in figure 2 it is the difference between the x coordinates: d = x − x (the last one is valid when the pixels x and x are indexes of a matrix). The baseline is defined as the line segment joining the optical centers OR and OL . 4. Problem formulation As it is mentioned in previous sections large number of accidents take place at night, and the IR sensor could help to improve the safety in those situations as well as during extreme weather conditions. Therefore, the research of new systems that exploit these novel technologies, which are available at market, and could help to road users avoid potential hazards is a open issue. In this way, it is proposed a framework for stereo correspondence that matches images of different modalities: thermal and color imagery. In order to obtain it, a first attempt is matching two images of same modality; color/color, because the 11 functions and methods require debugging. Afterwards, the framework will be extended to infrared images. Different stereo algorithms are briefly reviewed in the previous section, it could be concluded that dense stereo algorithms are the best choice, due to their fast response, easy hardware implementation, and suitability to ADAS. The disparity maps supplied by them are useful in several applications, such as navigation, detection, tracking, and others. In this section, the different modules that define the proposed multimodal framework are introduced. Figure 3 shows a flow chart of the modules, which are described in detail in next sections. • Image acquisition: the cameras were arranged as shown in figure 4, simulating their real position in a vehicle; and a dataset composed of several color and infrared images was generated. • Calibration: the next step is the cameras calibration, which compute the intrinsic and extrinsic parameters. This is an important stage, since the stereo algorithm is based on epipolar constraint. Therefore, the stability of system depends on the estimation of epipolar geometry, since, an error on the estimation, would prevent finding the right correspondence. • Rectification: is a transformation process used to project multiple images onto a common world reference system. It is used by the framework to simplify the problem of finding matching points between images. • Stereo Algorithm: consist of several steps to find the depth from a stereo pair of images; initially, in the visible spectrum, later, a multimodal pair (infrared and color). • Triangulation: the 3D position [X, Y, Z]T of a point P, can be recovered from its perspective projection on the image planes of the cameras, once the relative position and orientation of the two cameras are known and the projections are corresponded. Two approaches were studied to face the problem of sweep image. Sweep image is the strategy used to slice the correspondence window; this can be done in two ways: scan over epipolar lines or horizontally after rectification. In most camera configurations, finding correspondences requires a search in two dimensions. However, if the two cameras are aligned to have a common image plane, referred in the literature as rectification, the search is simplified to one dimension. The rectification stage builds a virtual system of aligned cameras, independently whether their base line are parallel to the horizontal axis of the images or not. Notice that, working with rectified images is computationally efficient because the slider window is moved in one direction (row direction). Unfortunately, ours stereo rig has the cameras far away (wide baseline), and the rectification stage, by effect of distance, changes the pixel values, in consequence the matching is more complex. We will show the cases where each ones is valid. The core of the framework is the stereo algorithm, which is composed of four stages: Matching cost computation, cost aggregation, disparity computation/optimization, and disparity refinement. The goal of these steps is to produce a univalued function in disparity space d(x, y) that best describes the shape of the surfaces in the scene. This can be viewed 12 as finding a surface embedded in the disparity space image that has some optimal property, such as lowest cost. Finally, the triangulation is the process that determines the location of a point an Euclidean space, knows their projections, intrinsic, and extrinsic camera parameters. Plane sweep IR Horizontal Calibration Rectification VS Epipolar Matching hi cost computation Cost (support) aggregation Disparity computation / optimization Disparity refinement Stereo Algorithm Triangulation Figure 3: Multimodal framework. 5. Images acquisition The image acquisition system is shown in figure 4. It consist of four cameras. A pair of cameras corresponding to a on the shelf stereo vision system (Bumblebee from Point Gray). Also, a conventional black and white (b/w) camera (VS1), and a IR camera (IR1)(Photon 160 of FLIR). ( 0 , y0, z0 ) RIR + tIR (x BIR VS1 VS2 VS3 BB (x0 , y0, z0 ) Bumblebee RBB andd TBB Figure 4: Camera setup. Bumblebee camera is used as a validation tool, providing the ground truth of the scene. It is related to multimodal stereo by a rotation RBB and a translation TBB . RBB and TBB are unknown parameters during the acquisition time and they need to be computed. The 13 main multimodal system is conformed of the b/w and infrared camera (VS1 and IR1). This is not the unique configuration, permuting the cameras is possible to have other multimodal stereo systems, for instance: VS2 with IR1 or VS3 with IR1. The main specifications of cameras used in current work are summarized in table 4. Table 4: Camera specifications. Specifications IR1 VS1 VS2 Image sensor type Thermal CCD CCD Resolution 164×129 752×480 640×480 Focal length 18 mm 6 mm Pixel size 51 μm 7.4 μm Image properties 14 bits 8 bits RGB VS3 CCD 640×480 6 mm 7.4 μm RGB 5.1 Experimental Results Next sections present in details both, the image acquisition system and the recorded data sets. 5.1.1 Image acquisition system and hardware The acquisition system was assembled with the cameras detailed in the table 4, simulating their real position on a vehicle. It was followed a schema like the one shown in fig. 4, its hardware implementation is depicted in figure 5(a). Since each camera delivers its own video stream and they were recorded in different PC. A mark videos synchronization was used. Furthermore, note that cameras record at different frame rate. The VS1 camera has a frame rate almost two times faster than the infrared camera (IR1). The mark was made with a blade and a pair of screws on the border; this helix was attached to a drill like a fan. Then, the 4 streams are synchronized using the position of the blade. In the infrared images, the screws are hot spots and shine points in color, due to the thermal differences and reflected light (see figure 5(b)). 5.1.2 Image dataset The image acquisition system presented above has been used for recording 4 sequences (2 indoor and 2 outdoor). These video sequences were used for validation the different algorithms developed thought the current work. Furthermore, the dataset (Scharstein and Szeliski, 2003), which consists of high-resolution stereo sequences with a complex geometry and a pixel-accurate ground-truth disparity data was concidered. In particular, the dataset called Teddy was used for validating the disparity maps generated by implementing different stereo algorithm (section 8). A couple the images of example in figures 6 and 8. The Bumblebee is a device speciality designed for computational modelling of scenes in 3-D from two images (fig. 7(a) and 7(b)). In the figure 4, this is represented by the VS2 and VS3 cameras, which are mechanically coupled. The Bumblebee is offered together with a framework coded in visual C++, allowing its modification for general proposes. Initially, the model 3-D of the scene (fig. 7(c) and 7(d)) was planned to be uses as ground 14 (a) Stereo rig. (b) Stream synchronizer. Figure 5: Image acquisition system and hardware. (b) IR camera. (a) VS1 camera. Figure 6: Multimodal dataset. truth for validating the result of the proposed algorithm. However, it is a better option to measure the error at a previous stage , and to avoid the additive effect of noise. Therefore, the validation is performed after the triangulation, over the disparity map, which is an intermediary representation. Nevertheless, the model generated by our algorithm is shown only for visual inspection. 6. Calibration Camera calibration is a necessary step in 3D computer vision in order to extract metric information from 2D images. Many works have been published by photogrammetry and computer vision communities, but none formally about calibration of multimodal systems. The multimodal stereo rig in the current work is built with an IR and color camera. Infrared calibration is tackled by two different procedures. Firstly, a thermal calibration must be performed, it means that sensed temperatures are equal to real, several approaches could be found, but generally the camera provider supply the software and calibration targets. Mainly, thermal calibration is useful in applications where temperature is a process variable, for example, thermal analysis of material, food, heat transfer, simulation, among others. The other one is camera calibration in the sense of computer vision. It is stated as follow: given one or more images estimate the intrinsic parameters, extrinsic parameters, or both. 15 (a) VS2 camera. (b) VS3 camera. (d) Corridor 3-D model. (c) Corridor 3-D model. Figure 7: Ground truth (Bumblebee). (a) Image 2. (b) Image 6. (c) Disparity Image 2. (d) Disparity image 6. Figure 8: Ground truth (Teddy). It is interesting to mention that previous researches avoid the camera calibration problem (intrinsic/extrinsic) by applying constrains on the scene. For instance, assuming cameras perfectly aligned or strategical camera set-up that cover the scene (Krotosky and Trivedi, 2007b), (Zheng and Laganiere, 2007), among other. These restrictions on the camera configuration are not considered in the current work because they are not valid in a real 16 environment (on board vision system, see 3.2.1). So, in this work it will not consider these restrictions or presumptions. Multimodal image calibration is summarized below. Firstly, Bouguet (2000) calibration software has been used, this toolbox assumes input images from each modality where a calibration board is visible in the scene. In typical visual setup, this is simply a matter of placing a checkerboard pattern in front of the camera. However, due to the large differences in visual and thermal imagery, some extra care needs to be taken to ensure the calibration board looks similar in each modality. A simple solution is to use a standard board and illuminate the scene with intensity halogen bulbs placed behind the cameras. This effectively warms the checkerboard pattern, making the visually dark checkers appear brighter in the thermal imagery. Placing the board under constant illumination reduces the blurring associated with thermal diffusion and keeps the checkerboard edge sharp, allowing to use whichever calibration toolbox, freely distributed on Internet. The previous procedure is well-known by the computer vision community. It uses available calibration tools and good results have been reported with them, but our experience showed that it is necessary some variations to reach optimal results, especially when the application depends on calibration results, such as in 3-d reconstruction. Furthermore, the IR cameras need a special attention, because their intrinsic parameters are slightly different to conventional cameras. For example, IR cameras have a small focal lengths, low resolution, and reduced field of view (FOV). 6.1 Experimental Results In order to compute the intrinsic and extrinsic camera parameter of the proposed multimodal stereo rig, the next procedures was followed. 6.1.1 Intrinsic parameters The procedure presented below is the standard method for calibrating a camera; this must be performed 3 times. The aim is to get as accurate as possible intrinsic parameters by extracting images of the initial sequence or video (see procedure 1). Procedure 1 Visible spectrum calibration. 1. Build a subset SV S1, V S2, V S3 of images for each sequence, where the calibration pattern is clearly defined. 2. Follow the standard calibration procedure (Bouguet, 2000). 3. Save the intrinsic camera parameter. For the calibration of IR cameras few changes were introduces to calibration toolbox. Firstly, the graphic interface is changed for points selection, now this shows two images: the edges detected and the original infrared image. The user could change the color map of infrared to enhance its contrast, thus easy identify the board from the background (see fig. 9(a)). 17 The temperature variation of objects that do not emit radiation is weak. Therefore, the magnitude of the gradient is small and edge detection is complicated. For this reason, a system where the users manually select the edges is develop. Also, it offers the possibility of undo and clear until edges are correctly marked. The user selects two points and the calibration tool join these points with a line. When the 4 edges are marked (see fig. 9(a)), the toolbox computes the intersection between the lines and the obtained points intersecting are used for calibration. Initially, the nails that join the calibration pattern to board (see fig. 9(c)) was used, however this approach does not work properly, since nail detection was not stable in sequence. finally, the calibration is performed with 4 points and using the procedure 2. Procedure 2 Infrared spectrum calibration. 1. Build a subset SIR1 of images where the edges of calibration board are differentiable of background. 2. Manually select 2 points for each edge, in total 8 points that describe 4 lines, the extended functions compute the intersection points and use them for the calibration (see fig 9(b)). 3. Follow the standard calibration procedure (Bouguet, 2000). 4. Save the intrinsic camera parameter. 6.1.2 Extrinsic parameters The procedure 3 to compute the extrinsic parameters of a multimodal stereo rig is similar to calibration of a general stereo rig. The Bouguet calibration toolbox have functions for calibrating a stereo system, which have been used. These functions return two results for the rotation (R) and translation (T ) values, (i) with optimization and (ii) without optimization. The best values for R and T were obtained without optimization and using the intrinsic parameters computed in the procedure 1 and 2. Procedure 3 Infrared and Visible spectrum calibration. 1. Load into the calibration toolbox the intrinsic parameter previously computed from procedures 1 and 2. 2. Build a subset Sm such that SmV S1 ⊆ SV S1 and SmIR1 ⊆ SIR1 given SmV S1 = SmIR1 . 3. Apply the calibration procedure 2; step 2, on the each subset (SmV S1 and SmIR1 ), using the extension of toolbox. 4. Compute the Rotation and Translation, combining the intrinsic parameters loaded in the step 1 and the points obtained in the previous step. 18 (a) IR points. (b) IR edges selection. (c) Initial approach. Figure 9: Infrared spectrum calibration. The table 5 and 6 show the intrinsic and extrinsic parameters of multimodal stereo rig shown in the figs. 4 and . Table 5: Camera calibration - Intrinsic parameters. Specifications Focal length Principal point Skew coefficient Distortion coefficients IR1 244.23 [81.50, 64.00] 0 [-0.40, 7.96, 0.09, -0.01, 0 ] VS1 954.94 [375.50, 239.50] 0 [-0.33 -0.02 0 0 0] VS2 836.80 [306.63, 240] 0 [-0.36 0.21 0 0 0] VS3 835.88 [322, 225.80] 0 [-0.34 0.20 0 0 0] 7. Rectification Given a pair of stereo images, the rectification procedure determines a transformation of each image plane such that pairs of conjugate epipolar lines become collinear and parallel to one of the image axes (usually the horizontal one). The rectified images could be interpreted as a new stereo rig, obtained by rotating the original cameras. It is assumed that the stereo rig is calibrated. In other words, the cameras intrinsic parameters of each camera and the extrinsic parameters of the acquisition system (R and 19 T ) are known. The rectification basically converts a general stereo configuration to a simple stereo configuration. The main advantage of rectification is to make simpler the computation of stereo correspondence, since the search is done along the horizontal lines into rectified images. Also, frequently dense stereo correspondence algorithms assume a simple stereo configuration as shown in the figure 2. In this section, an extension to the Fusiello et al. (2000) method, is proposed and implemented, to tackle the multimodal stereo system. 7.1 Camera model The camera model at section 3.3 describes the geometry of two views. Now, a projective camera model is assumed for infrared and spectrum visible cameras, see figure 10. So, a projective camera is modelled by its optical center C and its retinal plane or image plane (), at Z = f . A 3D point X is projected into an image point x given by the intersection of with the line containing C and X. The line containing C and orthogonal to is called the optical axis and its intersection with is the principal point p. The distance between C and p is the focal length (Hartley and Zisserman, 2004). Y X C X Y X x Z p Figure 10: Camera model. Let X = [X Y Z]T be the coordinates of X in the world reference frame and x = [x y]T the projection of X in the image plane (pixels). If the world and image points are represented by homogeneous vectors, then central projection is simply expressed as a linear mapping between their homogeneous coordinates. Let X̃ = [x y z 1]T and x̃ = [x y 1]T be the homogeneous coordinates of X and x respectively, then: x̃ = P X̃. (7) Table 6: Camera calibration - Extrinsic parameters. Parameter Rotation Translation Multimodal V S1 → IR1 V S1 ← IR1 ⎤ ⎡ ⎤ 0.996 −0.007 −0.080 0.997 0.016 0.065 ⎣ 0.018 0.990 0.133 ⎦ ⎣ −0.007 0.991 −0.129 ⎦ 0.078 −0.134 0.987 −0.066 0.128 0.989 ⎡ ⎤ ⎡ ⎤ 0.115 −0.122 ⎣ 0.005 ⎦ ⎣ 0.006 ⎦ −0.103 0.089 ⎡ 20 Bumblebee V S2 → V S3 V S2 ← V S3 ⎤ ⎡ ⎤ 0.999 0.004 0.000 0.999 −0.004 0.00 ⎣ −0.004 0.999 −0.006 ⎦ ⎣ 0.004 0.999 0.006 ⎦ −0.000 0.006 0.999 0.000 −0.006 0.999 ⎡ ⎤ ⎡ ⎤ −0.119 0.119 ⎣ 0.000 ⎦ ⎣ −0.000 ⎦ 0.002 −0.002 ⎡ The camera is therefore modeled by its perspective projection matrix P , which can be decomposed, using the QR factorization, into the product: P = K[R | t]. (8) The matrix K or camera calibration matrix depends on the intrinsic parameters only, and has the following form: ⎤ αx γ x0 K = ⎣ 0 αy x0 ⎦ , 0 0 1 ⎡ (9) where: αx = f mx and αy = f my represent the focal length of the camera in terms of pixel dimensions in the x and y direction respectively; f is the focal length; mx and my are the effective number of pixels per millimetre along the x and y axes; (x0 , y0 ) are the coordinates of the principal point; finally γ is the skew factor. In general, points in space will be expressed in terms of a different Euclidean coordinate frame, known as the world coordinate frame. Two coordinate frames are related via a rotation and a translation. If X is an inhomogeneous 3-vector representing the coordinates of a point in the world coordinate frame, and Xcam represents the same point in the camera coordinate frame, then it may write Xcam = R(X − C̃), where C̃ represents the coordinates of the camera centre in the world coordinate frame, and R is a 3 × 3 rotation matrix representing the orientation of the camera coordinate frame. This equation may be written in homogeneous coordinates as: Xcam = R −RC̃ 0 1 ⎛ ⎞ X ⎜ Y ⎟ ⎜ ⎟ = R −RC̃ X̃. ⎝ Z ⎠ 0 1 1 (10) Putting (10) together with (8) and (7) leads to the following equation: x = KR[I| − C̃]X, (11) where x is now in a world coordinate frame. It is often convenient not to make the camera centre explicit, and instead to represent the world to image transformation as Xcam = RX̃ + t. In this case the camera matrix is simply x = K[R|t]X, where t = −RC̃. (12) 7.2 Rectification of cameras matrices It is assumed that the stereo rig is calibrated, and the perspectives projection matrix P1 and P2 are known. The rectification process defines two new projective matrices P1 and P2 obtained by rotating the old ones around their optical centers until focal planes becomes coplanar, thereby containing the baseline. This ensures that epipoles are at infinity; hence, epipolar lines are parallel. To have horizontal epipolar lines, the baseline must be parallel to 21 the new X axis of both cameras. In addition, to have a proper rectification, corresponding points must have the same vertical coordinate. The calibration procedure explained in section 6 returns the following values: K1 , R1 , t1 , K2 , R2 , t2 to each cameras. Notice that, according to where the world reference is placed, either t1 or t2 is equal to [0 0 0]T ; the same happen for rotation results, one of them is equal to the identity matrix. In general, the equation (8) can be written as follow, since a general projective camera may be decomposed into blocks. ⎡ ⎤ ⎡ T ⎤ q11 q12 q13 q14 q1 q14 (13) Pn = ⎣ q21 q22 q23 q24 ⎦ = ⎣ q2T q24 ⎦ = [M m |pm 4 ], q31 q32 q33 q34 q3T q34 where M m is a 3 × 3 matrix and pm 4 a column vector of camera m. The coordinates of optical centres C1,2 are given by: c1 = −R1 K1−1 p14 , (14) c2 = (15) −R2 K2−1 p24 . In order to rectified the images is necessary to compute the new principal axes X n , Y n , and Z n . A special constrain is applied for computing X n , since theses axes must be parallel to the baseline, and to have horizontal epipolar lines, then: x̂n = (c2 − c1 ) . c2 − c1 (16) The new Y n axis is orthogonal to X n and old Z: ŷ n = (ẑ × x̂n ) , ẑ × x̂n (17) where old Z axis is third vector (r3T ) of rotation matrix R1 (camera 1): ⎡ T ⎤ r1 ⎣ R1 = r2T ⎦ . rT 3 (18) The new Z n axis is orthogonal to X n and Y n : ẑ n = (x̂n × ŷ n ) . x̂n × ŷ n (19) The previous procedure shows the steps for computing the new axes, but none image has been rectified. In order to do this image rectification, the new projective matrices P1n and P2n should be expressed in terms of their factorization. P1n = K1 [R∗ | − R∗ c1 ] = [M n1 |pn1 4 ], P2n ∗ ∗ = K1 [R | − R c1 ] = [M 22 n2 |pn2 4 ]. (20) (21) From eqs. (13), (20) and (21). It is needed to compute the transformation that mapping the image plane of P1 = [M 1 |p14 ] onto the image plane P1n = [M n1 |pn1 4 ]. In this point the problem can be seen in different ways: (i) it follows the original formulation (Fusiello et al., T T 1 1 P1n and P2 −→ P2n 2000), or (ii) Compute a linear transformation T1 such that P1 −→ are true (out approach). This transformation corresponds to the matrix R∗ , because the optical center of cameras C1 and C2 are not translated. Then, R∗ is bases, which spans the points of rectified images, and the vectors X n , Y n , and Z n their basis. They are linearly independent and can be written as the linear combination. ⎡ ⎤ x̂n R∗ = ⎣ ŷ n ⎦ , ẑ n (22) Finally, the image transformations T1 and T2 are expressed as follow: −1 T1 = M n1 (M 1 ) T2 = M n2 1 −1 (M ) (23) (24) The transformations are applied to the original images in order rectify them (see figure 11). It is not common that the value of rectified pixel, after the transformation, corresponds to its initial values. Therefore, the gray levels of the rectified image are computed by a bilinear interpolation. 7.3 Experimental Results The previous image rectification formulation has been coded in Matlab. Figure 11 shows some results. In the figures 11(a) and 11(b) a couple of examples, which were taken by the bumblebee (VS2 and VS3). It can be seen that these images are almost aligned, because the bumblebee cameras are calibrated and its principal rays parallel. Notice that, this can be appreciated by looking and comparing intrinsic and extrinsic camera parameters VS2 and VS3 (see tables 5 and 6). For example, the table 6 describes a base line horizontal, cameras aligned, while the row, yaw and pitch component indicates small rotations (V S2 → V S3 or V S2 ← V S3). In the figures 11(c) and 11(d), thin black bands surrounding the images, are visible. These bands are due to rotation, and mainly the displacement needed to align the camera planes until their final position. The figures 11(e) and 11(f) correspond to multimodal acquisition system (VS1 and IR1). As it was mentioned, their baseline (BIR ) is wide and the rotation marked. Hence, the band in the figures 11(g), and 11(h) is bigger than in the previous case. Notice that the size of the IR image is a quarter of the color one and for visualization purposes the bands were removed. So, a point from the rectified IR image could be found in the corresponding rectified one (VS1), only searching at a horizontal line that includes the initial point. 23 (a) Left bumblebee image. (b) Right bumblebee image. (c) Rectified left bumblebee image. (d) Rectified right bumblebee image. (e) IR image. (f) VS1 image. (g) Rectified IR image. (h) Rectified VS1 image. Figure 11: Rectification results. 8. Stereo algorithm In this section, the main stages of a stereo algorithm will be discussed. The Block matching approach has been followed for image correspondence. It is explained in detail below. The 24 implemented stereo algorithm was split up into: matching cost computation, aggregation, disparity computation, and disparity refinement, using as criteria the taxonomy presented by Scharstein et al. (2001). In order to evaluate the performance of each studied and implemented function images from Middlebury stereo dataset have been used, then images from our dataset have been tested. In general, the block matching methods seek to estimate disparity at a point in one image by comparing a small region about that point (the template), with a series of small regions extracted from the other image (the search region), This process is inefficient due to redundant computations; for example, it has been reported that a naive implementation of a block matching algorithm for an image of N pixels, a template size of n pixels, and a disparity search range of D pixels, has a complexity of O(NDn) operations. If some optimizations are included, the complexity is reduced to O(ND) operations, making it independent of the template size. These properties enable to block matching algorithms its hardware implementation, and they are attractive in real-time applications such as ADAS. 8.1 Matching cost computation For each pixel into the reference image and its putative matches into the second image, the degree of equality, trough a dissimilarity function is evaluated, these values are called matching cost, and they establishes a relation: f : Iref erence(i, j) → I2 (x, y) |x, y∈ search space of i, j , where it is weighted the cost that a pixel (i, j) in the reference image corresponds to other into the second image. If it is known the risk of matching pixels, then the correspondence problem could be formulated as an optimization problem (see fig. 12). The matching cost is computed by sliding a window of n × n (where n = 2k + 1 and k ∈ Z ∧ k > 1). The window center is slid on all the image, left to right and top to down. By each change at their position, a second window of size n × n is slid over the second image; in horizontal direction (rectified images), or epipolar direction when the images are not rectified. A dissimilarity function, as shown in the table 7, is used in order to detect potential matching. In general whatever relation could be used as dissimilarity function, whereas it has a minimum; formally: if f (x∗ ) f (x) when |x∗ − x| < ε then the point x∗ is the best match because the blocks are similar. The most common pixel-based matching costs for real time applications include squared intensity differences (SSD ) and absolute intensity differences (SAD ). Other traditional matching costs functions include normalized cross-correlation (NCC ), Pearson’s correlation, and znSSD. The statistical correlations are standard methods for matching two windows around a pixel of interest. Notice two facts, the normalization of a window, like performed by znSSD, compensates differences in gain and bias, improving the matching score, and the NCC function is statistically the optimal method for compensating Gaussian noise. The search for possible matches over whole epipolar line, or row, is inefficient and time consuming. Then, it is taken a set the pixel (collinear) that covers a small region, this action bounded the search space on an interval [dmin , dmax ]. These limits depend on scene and base line, For example, high values of dmax for indoors applications, it could be incorrect, because it is hopped that the objects are placed close. On the contrary, this could be correct in outdoor applications. 25 Table 7: Block-Matching functions Match metric Definition Absolute Differences (AD) CAD(x,y,d) = I1 (i, j) − I2 (i + d, j) (i,j)∈N (x,y) Sum of Absolute Differences (SAD) CSAD(x,y,d) = |I1 (i, j) − I2 (i + d, j)| (i,j)∈N (x,y) Sum of Squared Differences (SSD) CSSD(x,y,d) = (I1 (i, j) − I2 (i + d, j))2 (i,j)∈N (x,y) Zero normalized SSD (znSSD) CznSSD(x,y,d) = 1 (2n + 1)2 (i,j)∈N (x,y) Normalized Cross-Correlation (NCC) I1 (i, j) − I1 (N (x, y)) I2 (i, j) − I2 (N (x, y)) − σ1 σ2 (I1 (i, j) − I1 (N (x, y))) · (I2 (i + d, j) − I2 (N (x, y)) (i,j)∈N (x,y) CN CC(x,y,d) = (I1 (i, j) − I1 (N (x, y)))2 · (I2 (i + d, j) − I2 (N (x, y))2 (i,j)∈N (x,y) Pearson’s correlation (PC) CP C(x,y,d) = (i,j)∈N (x,y) (i,j)∈N (x,y) ⎛ 1 ⎝ I1 (x, y)I2 (x, y) − m∗n ⎛ 1 ⎝ I1 (x, y)2 − m∗n ⎞2 I1 (x, y)⎠ (i,j)∈N (x,y) I1 (x, y) (i,j)∈N (x,y) (i,j)∈N (x,y) ⎞ I2 (x, y)⎠ (i,j)∈N (x,y) ⎛ 1 ⎝ I2 (x, y)2 − m∗n ⎞2 I2 (x, y)⎠ (i,j)∈N (x,y) The set of cost values over all pixels and possible disparities shape the initial disparity space C0 (x, y, d), a example in figure 12(d). 8.2 Aggregation In the previous stage the cost of corresponding every pixel in reference image with another pixel in a search space in the second image. Then, a pixel has associated several cost values. Then, the aggregation step reduces this multidimensional space to only two dimensions (or matrix), simplifying the storage of cost matching registers. A review of the published literature on aggregation shows that it could be done, as a 2D or 3D convolution C(x, y, d) = w(x, y, d) ∗ C0 (x, y, d). In the case of rectangular windows: using box-filters (Scharstein et al., 2001); shiftable windows implemented using a separable sliding min-filter (Kimura et al., 1999); truncating the results of dissimilarity function (Yoon and Kweon, 2006). Different approaches were studied. Especially, the reported as the best: shiftable windows, summing the cost values, or take the maximum or minimum. Previously, It is shown that these approaches improve the result in a color stereo pair, but testing them in our framework, it does not obtains significant improvements, for this reason, these approaches finally are not followed. The results showed below was computed from the C0 (x, y, d) (without aggregation). Although, it required more time and disk space because all costs associated to a pixel into the reference image are kept. 26 (a) Left image (Iref erence (i = 333, j = 320)). (b) Right image (c) Matching cost of Iref erence (i = 333, j = 320) against I2 (i = 333, j = x) (d) Cost surface Iref erence (i = 333, j = x1 ) against I2 (i = 333, j = x2 ) Figure 12: Cost computation. 8.3 Disparity computation It remembers that at the previous subsection the matching cost of pixels in the reference image was computed. A dissimilarity function was used to evaluate the equality of a point and its surrounding. This operation is time consuming due to two facts: The testbed or framework was designed for an academic purpose, hence high performance and low latency were not the main aim. Secondly, it is needed debug and specially to understand the effect of dissimilarity function on the systems. For these reasons, every outcome of cost computation stage was storage. The disparity computation is fashioned as an optimization problem, where the cost C0 (x, y, d) is the variable to optimize, the disparity of each pixel is selected by the WTA (Winner-Takes-All) method without any global reasoning. The correct match of a pixel 27 is determined by position or coordinate where the dissimilarity function gets its minimum value (see figure 12(c)). The cost function not always has a global minimum, as the depicted in figure 12(c). This corresponds to an edge point, which is highly discriminative. The main problem in stereo is matching of points with a low salience. A textureless region produces a matching cost with a similar values for all pixels (see fig. 13(a)), then the cost function will be like a flat line. On the contrary, a textured region has a several of minimums (see fig. 13(b)). In the current work 3 rules are considered for to face these problems: • Matching cost threshold: if the minimum cost value is under a given threshold Tmin the matching is accepted. For instance figure 13(c) shows a valid minimum (or correspondence). • Matching cost average: in this rule, the area under the curve is computed, if this value is below of a threshold; then, the minimum is rejected and the point is marked as unmatched (see fig. 13(a)). • Matching cost interpolation: a variation of the sampling-insensitive calculation of Birchfield and Tomasi (1998) for sub-pixel interpolation was implemented, which is presented below. When a point in the world is imaged by two cameras, the intensity values of the corresponding pixels are in general different. Different factors contribute to this; the light reflected is not the same; the two cameras have different parameter; the intensities of the pixels are quantized, and noisy. Then, it is necessary to model this behaviour, Birchfield and Tomasi (1998) propose a measure of pixel dissimilarity that compares two pixels using the linearly interpolated intensity functions surrounding them. In this current work a similar approach has been explored, instead of operating over the intensity values, three values of matching cost are extracted: the minimum and two neighbours (left and right), see figure 13(d). These points Cm−1 , Cm , and Cm+1 are used to fit a quadratic function. The quadratic regression process return the coefficients a, b, and c of function fr (x) = ax2 + bx + c that fit the initial points with the minimum error. Next, it is minimized the function fr (x) over the domain [Cm−1 , Cm+1 ]. fr (x) is minimum where the cost function reach a inflexion point, indicating the position, coordinate, or index of conjugate point (match). Other models of regression different to ordinary least-squares regression could be used, but the computational cost becomes an important problem. Models based on splines, no lineal regression, among other generally fits better the tendency of data, however this operation must be performed for each pixel into the reference image, given that the size of image is n rows by m columns, the previous procedure is executed n × m times. 8.4 Disparity refinement computation A complete algorithm of stereo vision would have to include some procedure that refines the disparity map, early advances in computer vision elegantly express this problem in the language of Markov Random Fields (MRFs), resulting energy minimization problems. Recently, algorithms such as graph cuts and loopy belief propagation (LBP) have proven to 28 [Cost] [Cost] [Cost C ] Threshold Threshold [disparity] [disparity] (a) Deviation of cost curve. (b) Deviation of cost curve. [Cost] [Cost] In Interpolation Threshold Thre Minimum [disparity] Cm-1 Cm (c) Valid minimum Cm+1 [disparity] (d) Interpolation. Figure 13: Special cost matching conditions. be very powerful (Szeliski et al., 2008). For the current work only a left to right consistency checking and uniqueness validation are used to eliminate bad correspondences. Left-Right Consistency (LRC) check is performed to get rid of the half-occluded pixels. It is better to reject uncertain matches that to accept bad disparities. The LRC procedure is simple and effective for reducing the rate of bad matching. It consists of executing the previous steps by exchanging the reference image; thus two different disparity map are computed (dRight to lef t and dLef t to right ). Next, both maps are checked, if |dRight to lef t (x, y)− dLef t to right (x, y)| dmax varition , then the pair is rejected (Experimentally dmax varition = 3pixel). The points IR (x, y) and IL (x, y) with large variation in the right and left images are marked as does not match. Notice that at this point the correspondences are known, therefore this operation is valid. 8.5 Experimental Results This section, describes the quality metrics for evaluating the performance of stereo correspondence algorithms. Two measures proposed by Scharstein et al. (2001) have been used for evaluating the accuracy of the computed disparity maps. In order to evaluate the performance of current framework, the most relevant parameters will be varied. In this way, is quantitatively measured the proposed solution. In the figure 14 the different masks, which segment the ground truth (see in the the figure 8 the disparity map or ground truth), over each region are used to measured the error, using the equations (25) and (26). The regions considered are: occluded, no occluded, textured, textureless, near discontinuities, and all. As the table 8 shows. It compute the following two quality measures: 1. RMS (root-mean-squared) error (measured in disparity units) between the computed disparity map dC (x, y) and the ground truth map dT (x, y) , 29 (a) Occluded and discontinued regions and edges. (b) Occluded and textured regions (c) Occluded regions. Figure 14: Segmented regions. ⎞1 2 1 2 |dC (x, y) − dT (x, y)| ⎠ , R=⎝ N ⎛ (25) (x,y) where N is the total number of pixels. 2. Percentage of bad matching pixels, B= 1 (|dC (x, y) − dT (x, y)|) > δd ), N (26) (x,y) where δd is a disparity error tolerance. For these experiments δd = 1.0. Notices that the methodology followed for measuring the error or quality of proposed approach, is a common way to evaluate this kind of algorithms and ranking them. The regions cited in the table 8 are computed through simple binary operations between the reference image, figures 14(a), 14(b), and 14(c) with their ground truth (fig. 8(c)). The regions are defined as: textureless regions T where the squared horizontal intensity gradient averaged over a square window of a given size is below a given threshold; occluded regions O that are occluded in the matching image, and depth discontinuity regions D, edges dilated by a window of width 9 × 9. The table 9 shows the results of implemented dissimilarities functions: AD, SAD, SSD, znSSD, NCC, NMSDSSD, and pearson correlation. The window size was varied, late, measured their performance over each interest region. This result was obtained from the images shown in figures 8(a) and 8(b). The best results obtains for each dissimilarity function are shown in figure 15. 9. Discussion and concluding remarks As a general conclusion it could be stated that night vision represents a novel research field where the classical image processing techniques cannot be directly applied since the nature of the signal is different. The initially formulated question: “Could a couple of sensors measuring different bands of the electromagnetic spectrum, as the visible and infrared, form 30 Table 8: Quality metrics Name Symbol Description rms error all R RMS disparity error rms error nonocc RO RMS no occlucions rms error occ RO RMS at occlusions rms error textured RT RMS textured rms error textureless RT RMS textureless rms error discont RD RMS near discontinuities bad pixel all B Bad pixel percentage bad pixel nonocc BO Bad no occlucions bad pixel occ BO Bad at occlusions bad pixel textured BT Bad textured bad pixel textureless BT Bad textureless bad pixel discont BD Bad near discontinuities Figure 15: Dissimilarity function comparison. a part of an Advanced Driver Assistance System, and help to reduce the amount of injuries and deaths in the roads? ”can be affirmatively answered. Although the problem is really complex the research shows that it is feasible to develop a system that help the driver based on visible and infrared cameras. In other words, in the ADAS context is possible to register in a unique representation information from different sources: light reflected by an object together with its corresponding thermal emission. Several topics have been identified as future work. By sure that these topics will be the motivation of the research during the next years. The most important ones are summarized below: 31 Function Window size rms error all rms error nonocc rms error occ rms error textured rms error textredless rms error discont bad pixel all bad pixel nonocc bad pixel occ bad pixel textured bad pixel textredless bad pixel discont 3x3 23.86 22.73 30.64 22.40 24.95 21.27 97.21 97.06 98.23 96.90 98.16 96.11 9x9 24.31 23.06 31.75 23.01 23.37 22.32 97.29 97.18 98.07 97.27 96.53 96.96 AD 15x15 24.87 23.49 32.95 23.58 22.82 23.07 97.42 97.28 98.39 97.46 96.00 97.87 21x21 25.47 23.97 34.15 24.16 22.54 23.93 97.77 97.63 98.73 97.81 96.28 98.26 3x3 12.01 8.42 25.63 8.26 9.46 8.29 41.69 33.67 97.82 30.29 57.88 39.21 SAD 9x9 15x15 10.94 11.01 6.36 6.32 25.96 26.27 6.05 5.95 8.30 8.50 6.22 6.56 30.74 31.57 21.26 22.24 97.13 96.91 19.08 20.04 36.84 37.95 37.78 43.64 21x21 11.17 6.52 26.47 6.19 8.48 7.13 33.10 23.96 97.03 21.65 40.50 47.14 3x3 11.99 8.36 25.70 8.22 9.35 8.36 40.49 32.28 97.91 29.00 55.77 38.56 21x21 11.26 6.55 26.73 6.26 8.34 7.76 35.44 26.58 97.37 24.42 42.09 52.23 3x3 14.15 11.01 27.44 10.79 12.49 11.02 43.68 35.95 97.77 32.99 57.21 42.82 znSSD 9x9 15x15 10.94 11.20 5.54 5.70 27.25 27.84 5.43 5.52 6.26 6.85 8.01 8.23 26.38 29.97 16.24 20.34 97.31 97.30 15.59 19.21 20.86 28.48 43.32 49.49 21x21 11.55 6.18 28.26 5.91 7.86 8.57 33.64 24.52 97.45 22.76 37.12 51.85 Table 9: Quality results. SSD 9x9 15x15 11.03 11.11 6.49 6.42 26.06 26.43 6.22 6.12 8.17 8.26 6.86 7.29 31.11 32.92 21.66 23.73 97.27 97.20 19.72 21.78 35.52 37.72 42.80 49.23 3x3 16.52 14.29 27.47 13.67 18.14 13.54 54.80 48.61 98.06 45.25 72.73 53.48 NCC 9x9 15x15 11.00 10.84 6.21 5.81 26.42 26.52 5.90 5.53 8.07 7.54 8.49 8.15 26.54 29.59 16.47 20.03 97.07 96.50 15.70 18.88 21.97 28.25 42.80 48.82 21x21 11.09 6.26 26.63 5.96 8.12 8.47 33.26 24.16 96.94 22.35 37.16 51.26 3x3 15.57 13.06 27.29 12.42 16.99 12.42 49.07 42.11 97.75 38.76 66.14 47.95 NMSDSSD 9x9 15x15 10.97 10.91 5.91 5.66 26.81 26.97 5.61 5.35 7.71 7.55 8.03 7.82 26.15 29.40 16.01 19.78 97.06 96.68 15.32 18.63 21.01 28.02 41.75 48.06 21x21 11.20 6.11 27.24 5.77 8.12 8.06 33.12 23.99 97.00 22.16 37.09 50.73 3x3 16.52 14.28 27.47 13.66 18.12 13.54 54.80 48.62 98.06 45.26 72.73 53.50 Pearson 9x9 15x15 11.00 10.84 6.21 5.81 26.42 26.52 5.90 5.53 8.07 7.54 8.49 8.15 26.54 29.59 16.47 20.03 97.07 96.50 15.70 18.88 21.97 28.25 42.80 48.82 21x21 11.09 6.26 26.63 5.96 8.12 8.47 33.26 24.16 96.94 22.35 37.16 51.26 32 1. During this research the row fusion methodology has been fully explored; thus, according to the obtained results it is expected that a better performance could be reached if a high level fusion where used. This fusion should include high level data representations from both modalities; as examples it could be mentioned: segmented images, optical flow registration; mutual information; 2. Study the possibility to exploit the camera hardware in order to have a proper video synchronization (the IR video sequence has been recorded at 9 fps, while the VS at 20 fps). This asynchrony represents an additional source of error since moving objects in the scene are not captured at the same instant. 3. The use of 3D data, computed from the Bumblebee camera, will be considered in order to validate the obtained results. Furthermore, it should be mentioned that an additional validation process, that measure the performance of each individual stage, is needed. 4. A particular problem of the proposed multimodal system is related to the difference between camera resolution. The the IR camera has a lower resolution than the VS one, hence current triangulation algorithms should be extended to consider this challenge. In this work, it is performed a experimental study that compares different dissimilarity function, algorithms and approach, in order to build a multimodal system. Also, it is identified common problem areas for stereo algorithms in different modalities, special in the ADAS. A important contribution is the isolation achieved, between the different stage that compose a multimodal stereo system. Our framework summarizes the architecture of a generic stereo algorithm, at different levels: computational, functional, and structural, which is successful because in short term this can be extended to a complete dense multimodal testbed, that until the moment some stages support two modalities. References S.A. Ahmed, T.M. Hussain, and T.N. Saadawi. Active and passive infrared sensors for vehicular traffic control. In IEEE 44th Vehicular Technology Conference, pages 1393– 1397 vol.2, Jun 1994. M. Bertozzi, E. Binelli, A. Broggi, and M.D. Rose. Stereo vision-based approaches for pedestrian detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 15–16, June 2005. S. Birchfield and C. Tomasi. A pixel dissimilarity measure that is insensitive to image sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(4):401– 406, Apr 1998. J. Bouguet. Matlab camera calibration toolbox. 2000. FLIR. Bmw incorporates thermal imaging cameras in its cars lowering the risk of nocturnal driving. Technical report, FLIR Commercial Vision Systems B.V., 2008. 33 A. Fusiello, E. Trucco, and A. Verri. A compact algorithm for rectification of stereo pairs. Machine Vision and Applications, 12(1):16–22, 2000. N. Gheissari and A. Bab-Hadiashar. A comparative study of model selection criteria for computer vision applications. Image and Vision Computing, 26(12):1636 – 1649, 2008. J. Han and B. Bhanu. Fusion of color and infrared video for moving human detection. Pattern Recognition, 40(6):1771 – 1784, 2007. R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004. M. Hild and G. Umeda. Image registration in stereo-based multi-modal imaging systems. In The 4th International Symposium on Image and Signal Processing and Analysis, pages 70–75, Sept. 2005. R. Istenic, D. Heric, S. Ribaric, and D. Zazula. Thermal and visual image registration in hough parameter space. In Systems, Signals and Image Processing, pages 106–109, June 2007. M.D. Keall, W.J. Frith, and T.L. Patterson. The contribution of alcohol to night time crash risk and other risks of night driving. Accident Analysis & Prevention, 37(5):816 – 824, 2005. ISSN 0001-4575. Y.S. Kim, J.H. Lee, and J.B. Ra. Multi-sensor image registration based on intensity and edge orientation information. Pattern Recognition, 41(11):3356 – 3365, 2008. S. Kimura, K. Nakano, Sinbo T., H. Yamaguchi, and E. Kawamura. A convolverbased real-time stereo machine (sazan). IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1:1457, 1999. ISSN 1063-6919. doi: http://doi.ieeecomputersociety.org/10.1109/CVPR.1999.786978. K. Konolige. Small vision systems: Hardware and implementation. Eighth International Symposium Robotics Research, pages 111–116, 1997. S.J. Krotosky and M.M. Trivedi. On color-, infrared-, and multimodal-stereo approaches to pedestriandetection. IEEE Transactions on Intelligent Transportation Systems, 8(4): 619–629, Dec. 2007a. ISSN 1524-9050. S.J. Krotosky and M.M. Trivedi. Person surveillance using visual and infrared imagery. IEEE Transactions on Circuits and Systems for Video Technology, 18(8):1096–1105, Aug. 2008. S.J. Krotosky and M.M. Trivedi. Registration of multimodal stereo images using disparity voting from correspondence windows. In IEEE International Conference on Video and Signal Based Surveillance, pages 91–97, Nov. 2006. S.J. Krotosky and M.M. Trivedi. Mutual information based registration of multimodal stereo videos for person tracking. Computer Vision and Image Understanding, 106(2-3): 270 – 287, 2007b. Special issue on Advances in Vision Algorithms and Systems beyondthe Visible Spectrum. 34 S.J. Krotosky, S.Y. Cheng, and M.M. Trivedi. Face detection and head tracking using stereo and thermal infrared cameras for “smart ”airbags: A comparative analysis. In The 7th International IEEE Conference on Intelligent Transportation Systems, pages 17–22, Oct. 2004. S.K. Kyoung, J.H. Lee, and J.B. Ra. Robust multi-sensor image registration by enhancing statistical correlation. In 8th International Conference on Information Fusion, volume 1, pages 7 pp.–, July 2005. H. Li and Y. Zhou. Automatic eo/ir sensor image registration. In International Conference on Image Processing, volume 3, pages 240–243 vol.3, Oct 1995. J. Michels, A. Saxena, and A.Y. Ng. High speed obstacle avoidance using monocular vision and reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pages 593–600, New York, NY, USA, 2005. ACM. ISBN 1-59593-180-5. H. K. Moravec. Visual mapping by a robot rover. In Proceedings of the 6th International Joint Conference on Artificial Intelligence, pages 598–600, Tokyo, 1979. F. Morin, A. Torabi, and G.A. Bilodeau. Automatic registration of color and infrared videos using trajectories obtained from a multiple object tracking algorithm. In Canadian Conference on Computer and Robot Vision, pages 311–318, May 2008. S. Prakash, Lee P.Y., and T. Caelli. 3d mapping of surface temperature using thermal stereo. In 9th International Conference on Control, Automation, Robotics and Vision, pages 1–4, Dec. 2006. A. Rogalski. Infrared detectors: An overview. Infrared Physics & Technology, 43(3-5):187 – 210, 2002. A.D. Sappa, F. Dornaika, D. Ponsa, D. Geronimo, and A. Lopez. An efficient approach to onboard stereo vision system pose estimation. IEEE Transactions on Intelligent Transportation Systems, 9(3):476–490, Sept. 2008. A. Saxena, M. Sun, and A.Y. Ng. Make3d: Learning 3d scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5):824–840, 2009. ISSN 0162-8828. D. Scharstein and R. Szeliski. High-accuracy stereo depth maps using structured light. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1:195–202, June 2003. D. Scharstein, R. Szeliski, and R. Zabih. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IEEE Workshop on Stereo and Multi-Baseline Vision, pages 131–140, 2001. K. Schreiner. Night vision: Infrared takes to the road. IEEE Computer Graphics and Applications, 19(5):6–10, 1999. 35 D. Scribner, P. Warren, and J. Schuler. Extending color vision methods to bands beyond the visible. In IEEE Workshop on Computer Vision Beyond the Visible Spectrum: Methods and Applications, pages 33–40, 1999. doi: 10.1109/CVBVS.1999.781092. S. Segvic. A multimodal image registration technique for structured polygonal scenes. In Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, pages 500–505, Sept. 2005. R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, Tappen M.F., and Rother C. A comparative study of energy minimization methods for markov random fields with smoothness-based priors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(6):1068–1080, 2008. M.M. Trivedi, Y. C. Shinko, E.M.C. Childers, and S.J. Krotosky. Occupant posture analysis with stereo and thermal infrared video: Algorithms and experimental evaluation. IEEE Transactions on Vehicular Technology, 53(6):1698–1712, Nov. 2004. ISSN 0018-9545. doi: 10.1109/TVT.2004.835526. World Health Organization WHO. World report on road traffic injury prevention. Technical report, Department of Violence & Injury Prevention & Disability (VIP), Geneva, 2004. World Health Organization WHO. Global status report on road safety. Technical report, Department of Violence & Injury Prevention & Disability (VIP), 20 Avenue Appia, 2009. K.J. Yoon and I.S. Kweon. Adaptive support-weight approach for correspondence search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):650–656, April 2006. ISSN 0162-8828. doi: 10.1109/TPAMI.2006.70. Z.X. Zhang and P.Y. Cui. Information fusion based on optical flow field and feature extraction for solving registration problems. In International Conference on Machine Learning and Cybernetics, pages 4002–4007, Aug. 2006. L. Zheng and R. Laganiere. Registration of ir and eo video sequences based on frame difference. Canadian Conference on Computer and Robot Vision, pages 459–464, 2007. 36