Evol. Intel. (2010) 3:155–169 DOI 10.1007/s12065-010-0043-y RESEARCH PAPER GPU implementation of a road sign detector based on particle swarm optimization Luca Mussi • Stefano Cagnoni • Elena Cardarelli Fabio Daolio • Paolo Medici • Pier Paolo Porta • Received: 8 February 2010 / Revised: 9 July 2010 / Accepted: 23 September 2010 / Published online: 15 October 2010 Springer-Verlag 2010 Abstract Road Sign Detection is a major goal of the Advanced Driving Assistance Systems. Most published work on this problem share the same approach by which signs are first detected and then classified in video sequences, even if different techniques are used. While detection is usually performed using classical computer vision techniques based on color and/or shape matching, most often classification is performed by neural networks. In this work we present a novel modular and scalable approach to road sign detection based on Particle Swarm Optimization, which takes into account both shape and color to detect signs. In our approach, in particular, the optimization of a single fitness function allows both to detect a sign belonging to a certain category and, at the same time, to estimate its position with respect to the camera reference frame. To speed up processing, the algorithm implementation exploits the parallel computing capabilities offered by modern graphics L. Mussi S. Cagnoni (&) E. Cardarelli P. Medici P. P. Porta Dipartimento di Ingegneria dell’Informazione, University of Parma, Viale G. Usberti 181a, 43124 Parma, Italy e-mail: cagnoni@ce.unipr.it L. Mussi e-mail: mussi@ce.unipr.it E. Cardarelli e-mail: cardar@ce.unipr.it P. Medici e-mail: medici@ce.unipr.it P. P. Porta e-mail: porta@ce.unipr.it F. Daolio Information Systems Institute (ISI) - HEC, University of Lausanne, Internef 135, CH-1015 Lausanne, Switzerland e-mail: fabio.daolio@unil.ch cards and, in particular, by the Compute Unified Device Architecture by nVIDIA. The effectiveness of the approach has been assessed on both synthetic and real video sequences, which have been successfully processed at, or close to, full frame rate. Keywords Particle swarm optimization Road sign detection GPU computing Parallel computing 1 Introduction Automatic traffic sign detection and classification is a very important issue for the Advanced Driver Assistance Systems (ADAS) and road safety. It can both improve safety and help navigation, by providing critical information the driver could otherwise miss, limiting or compensating for drivers’ distractions. Because of this, several road sign detectors have been developed in the last 10 years [24]. In most industrial systems only speed limit signs are detected, since these are considered to be the most relevant for safety. Nevertheless, information provided by warning signs, mandatory signs and all the remaining prohibitory signs can also be extremely significant: ignoring the presence of such signs can lead to dangerous situations or even accidents. Automatic road sign detection systems can be used to both warn drivers in these situations and supply additional environmental information to other on-board systems such as the Automatic Cruise Control (ACC), the Lane Departure Warning (LDW), etc. Both gray-scale and color cameras can be used to this purpose: in the first case, search is mainly based on shape and can be quite demanding in terms of computation time [10, 19]. Using a color camera, the search can be based mainly on chromatic information: color segmentation is, in 123 156 general, faster than shape detection, even if it requires additional filtering; however, images acquired by inexpensive color cameras can suffer from artifacts deriving from Bayer conversion or from other problems related, for instance, to color balance [2]. The approaches to traffic sign detection which rely on color images are usually based on color bases different from RGB; the HSV/HSI color space is the most frequently used [7, 35] but other color spaces, such as CIECAM97 [9], can be used as well. On the one hand, these spaces separate chromatic information from lighting information, making detection of a specified color mostly independent of light conditions. On the other hand, the RGB [32] and YUV [31] color spaces require no transformations, or just very simple ones; however, they require more sophisticated segmentation algorithms, since the boundary between colors is fuzzier. In order to make detection more robust, both color segmentation and shape recognition can be used in cooperation [9]. As regards sign recognition, most methods are based on computational intelligence techniques [6], the most frequent being neural networks [8, 11] and fuzzy logic [13]. In this paper we present a novel approach to road sign detection based on geometric transformations of basic sign templates. Our approach projects sets of three-dimensional points, which sample significant regions of the road sign template to be detected and describe its shape, onto the image plane according to a transformation which maps 3D points in the camera reference frame onto the image; the transformed set of points is then matched to the corresponding image pixels. The likelihood of detection is estimated using a similarity measure between color histograms [34]. This procedure can actually estimate the pose of any object based on a 3D model within any projection system and with any general object model. One of the advantages over other model-based approaches is that this approach does not need any preliminary pre-processing of the image (like, for example, color segmentation) or any reprojection of the full three-dimensional model [34]. Another peculiar feature with respect to similar work [7], besides the aforementioned similarity measure, is that our method relies upon Particle Swarm Optimization (PSO) to estimate, by a single transformation, the pose of the sign in the 3D space at the same time as the position of the sign in the image. Despite being more efficient than many other metaheuristics, PSO is still rather demanding in terms of computational resources; therefore, a sequential implementation of the algorithm would be too slow for real-time applications. As for all other metaheuristics, this is especially true when the function to be optimized is itself computationally complex. The PSO algorithm we have used in this work has been implemented within the nVIDIA 123 Evol. Intel. (2010) 3:155–169 CUDA environment [23, 25], to take advantage of the computing power offered by the massively parallel architectures available nowadays even on cheap consumer video cards. As will be shown, thanks also to the parallel nature of PSO, this choice allowed the final system to manage several swarms at the same time, each specialized in detecting a specific class of road signs. This paper is organized as follows: Sect. 2 briefly introduces PSO and its parallel implementation within CUDA; Sect. 3 addresses the problem of road sign detection, motivating our approach and offering further details on how shape and color information is processed to compute fitness. Finally, in Sect. 4, we report results obtained on both a synthetic video sequence containing two signs and on two real video sequences, acquired on-board a car, for a total running time of about 30 min. 2 GPU implementation of particle swarm optimization Particle Swarm Optimization is a simple but powerful optimization algorithm, introduced by Kennedy and Eberhart [15]. In the last decade many variants of the basic PSO algorithm have been developed [18, 26, 29] and successfully applied to many problems in several fields [28], image analysis being one of the most frequent ones. In fact, image analysis tasks can be often reformulated as the optimization of an objective function, directly derived from the physical features of the problem being solved. Beyond this, PSO can often be more than a way to ‘tune’ the parameters of another algorithm, but can be directly the main building block for an original solution. For example, [3, 5, 21, 27, 38] use PSO to directly infer the position of an object that is sought in the image. 2.1 PSO basics PSO searches for the optimum of a fitness function, following rules inspired by the behavior of flocks of birds in search of food. A population of particles move within the fitness function domain (usually termed their search space), sampling the function in the points corresponding to their position. This means that, after each particle’s move, the fitness computed at its new position is evaluated. In their motion, particles preserve part of their velocity (inertia), while undergoing two attraction forces: the first one, called cognitive attraction, attracts a particle towards the best position it visited so far, while the second one, called social attraction, pulls the particle towards the best position ever found by the whole swarm. Based on this model, in basic PSO, the following velocity and position update equations are computed for each particle: Evol. Intel. (2010) 3:155–169 157 Vi ðtÞ ¼ w Vi ðt 1Þ þ C1 R1 ½Xib ðt 1Þ Xi ðt 1Þ þ C2 R2 ½Xigb ðt 1Þ Xi ðt 1Þ Xi ðtÞ ¼ Xi ðt 1Þ þ Vi ðtÞ ð1Þ ð2Þ where the subscript i refers to the i-th dimension of the search space, V is the velocity of the particle, C1, C2 are two positive constants, w is the inertia weight, X(t) is the particle position at time t, Xb(t - 1) is the best fitness position visited by the particle up to time t - 1, Xgb(t - 1) is the best fitness point ever visited by the whole swarm; R1 and R2 are two random numbers from a uniform distribution in [0, 1]. Many variants of the basic algorithm have been developed [29], some of which have focused on the algorithm behavior when different topologies are defined for the neighborhoods of the particles [16]. A usual variant of PSO substitutes Xgb ðt 1Þ with X lb ðt 1Þ, which represents the ‘local’ best position ever found by all particles within a pre-set neighborhood of the particle under consideration. This formulation admits, in turn, several variants, depending on the topology of such neighborhoods. Among others, Kennedy and coworkers evaluated different kinds of topologies, finding that good performance could be achieved using random and Von Neumann neighborhoods [16]. Nevertheless, the authors also indicated that selecting the most efficient neighborhood structure is, in general, a problem dependent task. Since the random topology is usually designed such that each particle communicates with two random neighbors, most often using a simple ring topology is adequate for the problem at hand, while allowing for an easier implementation. Whatever the choices of the algorithm structure, parameters, etc., and despite good convergence properties, PSO is still an iterative process which, depending on problem difficulty, may require several thousands (when not millions) of particle updates and fitness evaluations. Therefore, designing efficient PSO implementations is a problem of great practical relevance. This becomes even more critical, if one considers real-time applications to dynamic environments in which, for example, the fast convergence properties of PSO may be used to track moving points of interest (maxima or minima of a specific dynamicallychanging fitness function) in real time. This is the case, for example, of computer vision applications in which PSO has been used to track moving objects [22] or to determine location and orientation of objects or people [12, 21]. 2.2 Implementing PSO within CUDA We implemented a standard PSO with particles organized with the classical ring topology [23]. The rationale behind this choice is, on the one hand, the inadequacy of PSO with synchronous best update and global-best topology, which would have been the most natural and easiest parallel implementation, for optimizing multi-modal problems [4]. On the other hand, as reported above, PSO with ring topology provides a very good compromise between quality of results, efficiency, and easiness of implementation. The parallel programming model of CUDA allows programmers to partition the main problem in many subproblems that can be solved independently in parallel. Each sub-problem may then be further decomposed into many modules that can be executed cooperatively in parallel. In CUDA, each sub-problem becomes a thread block, which is composed by a certain number of threads which cooperate to solve the sub-problem in parallel. The software modules that describe the operation of each thread are called kernels: when a program running on the CPU invokes a kernel, a unique set of indices is assigned to each thread, to denote to which block it belongs and its position inside it. These indices allow each thread to ‘personalize’ its access to the data structures and, in the end, to achieve problem parallelization and decomposition. To exploit the impressive computation capabilities of graphic cards effectively within CUDA and implement a parallel version of PSO, the best approach is probably to consider the main phases of the algorithm as separate tasks, parallelizing each of them separately: this way, each phase can be implemented by a different kernel and the whole optimization process can be performed by iterating the basic kernels needed to perform one generational update of the swarm. Since the only way CUDA offers to share data among different kernels is to keep them in global memory (i.e., the RAM region, featuring the slowest access time by far, which is shared by the processes run by the GPU and the ones run by the CPU) [25], the current status of our PSO must be saved there. Data organization is therefore the first problem to tackle to exploit the GPU read/write coalescing capability and maximize the degree of parallelism of the implementation. With our data design, it is enough to appropriately arrange the thread indices to run several swarms at the same time very efficiently. In order to have all tasks performed on the GPU, and avoid, as much as possible, the bottleneck of data exchange with the CPU using global memory, we generate pseudorandom numbers running the Mersenne Twister [20] algorithm directly on the GPU using the kernel available within the CUDA SDK: this way the CPU load is virtually zero. In the following we briefly describe the three kernels into which our PSO implementation has been subdivided. 123 158 Evol. Intel. (2010) 3:155–169 2.2.1 Position update 3.1 The image projection model and camera calibration A computation grid, divided into a number of blocks of threads, updates the position of all particles being simulated. Each block updates the data of one particle, while each thread in a thread block updates one element of the position and velocity arrays. In the beginning the particle’s current position, personal best position, velocity and local best information are loaded, after which the classical PSO equations are applied. The main goal of computer vision is to make a computer analyze and ‘understand’ the content of an image, i.e., the projection of a region of the real world which lies within the field of view of a camera onto the camera’s sensor plane (the image plane), in order for it to be able to take some decision based on such an analysis. The simplest mathematical model which describes the spatial relationships between the 3D real-world scene and its projection on the image pixels is a general affine transform of the following form: 2.2.2 Fitness evaluation p i ¼ A Pi This kernel is scheduled as a computation grid composed by one block for each particle being simulated (irrespective of the swarm to which it belongs). Each block comprises a number of threads equal to the total number of points that describe a sign (three sets of 16 points each) so that the projection of all points on the current image is performed in parallel. Successively, each thread contributes to building the histograms described in Sect. 3.3: the thread index determines to which set/histogram the projected point under consideration belongs, while the sampled color value determines which bin of the histogram is to be incremented. Finally the fitness value is computed according to Eq. 9 where, once again, histogram similarity is assessed in parallel. 2.2.3 Bests update For each swarm, a thread block is scheduled with a number of threads equal to the number of particles in the swarm. As already mentioned, in our system we have used a ring topology with radius equal to 1. Firstly, each thread loads in shared memory both the current and the best fitness values of its corresponding particle, to update the personal best, if needed. Successively, the current local best fitness value is found by computing the best fitness of each particle’s neighborhood (including the particle and the neighboring one on both sides of the ring), comparing it to the best value found so far and updating it, when necessary. 3 Road sign detection In this section, we first introduce the basics of projective geometry which underlie the theory of image acquisition by means of a camera. Then, we describe the road sign detection algorithm, based on computer vision and PSO, focusing, in particular, onto the fitness function, whose optimization drives the whole detection process. 123 ð3Þ where Pi represents the 3D coordinates of a point in the world, while pi represents its 2D projection on the image expressed with homogeneous coordinates. The matrix A [ M3x3 models a central linear projection and is usually expressed as 2 3 f x 0 u0 A ¼ 4 0 fy v0 5: ð4Þ 0 0 1 Here, briefly, fx and fy are, respectively, estimates of the focal length along the x and y direction of the image and (u0,v0) is an estimate of the principal point (a.k.a. the ‘center of projection’) of the image. The process whose aim is to determine the above parameters as well as the position of the camera in the world (not needed in our case) is usually referred to as camera calibration. The so-called extrinsic parameters describe the camera position and orientation while the intrinsic ones are those appearing in Eq. 4, i.e. focal lengths and center of projection. In the literature many algorithms for camera calibration have been proposed for monocular [30, 36, 37] and stereo systems [17, 39]. Many of them are based on particular hypotheses that ease the calibration step; usually these hypotheses are not verified in automotive environments, such as short-distance perception, still scenarios, or still camera. In our case, we can consider the last two constraints to be satisfied, since we are not taking into account the correlation between subsequent frames of a video sequence to infer a model of the car motion and forecast trajectories, but analyze each frame as an independent image. We also set the origin of our ‘world’ reference frame which, in our case, is coincident with the camera frame, to be a fixed point in the car (see Fig. 1). Other issues to be tackled in general outdoor and, especially, in automotive applications are related to specific conditions of outdoor environments. In fact, temperature and illumination conditions can vary and can be barely controlled. Regarding illumination, in particular, extreme situations like direct sunlight or strong reflections Evol. Intel. (2010) 3:155–169 Fig. 1 The projection model used in our system: OwX, OwY, OwZ are the three axes of the world reference system (which in our case is coincident with the camera reference system), f is the focal distance, (u0,v0) is the projection center, and pi : (ui, vi) is the projection of a generic point Pi : (Xi, Yi, Zi) onto the image plane must be taken into account. Other light sources, such as car headlights or reflectors, interfering with the external environmental light, might also be present in a typical automotive scene. 3.2 PSO-based road sign detection algorithm Suppose that an object of known shape and color, having any possible orientation, may appear within the field of view of a calibrated camera. In order to detect its presence and, at the same time, to precisely estimate its position, one can use the following algorithm (see also [34]): 1. 2. 3. Consider a set of key contour points, of known coordinates with respect to a reference position, and representative of the shape and colors of the object. Translate (and rotate) them to a hypothesized position visible by the camera and project them onto the image. Verify that color histograms of the sets of key points match those of their projection on the image to assess the presence of the object being sought. Road signs are relatively simple objects belonging to few possible classes characterized by just a few regions of homogeneous colors. Each sign class can be described by a model consisting of a few sets of key points which lie just near the color discontinuities, with points belonging to the same set being characterized by the same color. Once all points in a set are projected onto the image plane, one must verify that the colors of the corresponding pixels in the image match the ones in the model. A further set of points, lying just outside the object silhouette, can help verify whether the object border has been detected: this is, in general, confirmed when colors of corresponding pixels in 159 such a region are significantly different from those of the object. In Fig. 2 we show three classes of traffic signs (priority, warning, and prohibitory signs), along with the sets of points of the model we use to represent them. For each model, we consider three sets of 16 points: one lies just outside the external border (therefore, on the image background), one on the red band just inside the external border, and one on the central white area, as close to the red border as possible. Please notice that, for the prohibitory signs, we use points uniformly distributed along their circular border while, for the triangular priority and warning signs, points are more densely distributed in proximity of the corners. This choice reduces the chance of mismatching circular signs to triangular ones since, at a similar scale, the corners of triangular signs lie well outside the borders of the circular ones. If a calibrated camera is available on a moving car, given an estimate of the position and rotation of a road sign inside the camera’s 3D field of view, the sets of points in the world reference frame can be roto-translated to this position and then projected onto the image plane, to verify the likelihood of the estimate by matching color histograms. All is needed for detection is a method to generate estimates of signs positions and refine them until the actual position of a sign is found. When a pose estimate is available for a sign, all points belonging to its model can then be projected onto the image plane using the following equation: pi ¼ A ðRe Pi þ te Þ ð5Þ where te represents the offset/position of the sign in the x, y and z directions with respect to the camera mounted on the car (in our case, to the world reference system, as well). Re is a 3 9 3 rotation matrix derived from the estimate of the sign rotation: since a free rotation in the 3D space can always be expressed with three degrees of freedom, it is sufficient to estimate three values (e.g., the rotation angles around the three axes) in order to represent all possible rotations of a sign. To this aim we apply PSO, as introduced in Sect. 2. In our method, each swarm generates location estimates for a specific class of signs; each particle in the swarm encodes an estimate of the sign position by four values, which represent its offsets along the x, y and z axes, as well as its rotation around the vertical axis (yaw) in the camera reference frame. Although our system is already structured for estimating all six degrees of freedom of a pose estimate, we deliberately chose to ignore the rotation around the camera optic axis (roll) and the horizontal axis (pitch) after some preliminary tests. Although it makes sense to have the system able to estimate every possible rotation of a sign, we had no experimental evidence about this need, at least 123 160 500 400 Evol. Intel. (2010) 3:155–169 500 set 1 set 2 set 3 400 500 set 1 set 2 set 3 400 300 300 300 200 200 200 100 100 100 0 0 0 -100 -100 -100 -200 -200 -200 -300 -300 -300 -400 -400 -400 -500 -500 -500 -400 -300 -200 -100 0 -500 -400 -300 -200 -100 0 100 200 300 400 500 100 200 300 400 500 set 1 set 2 set 3 -500 -500 -400 -300 -200 -100 0 100 200 300 400 500 Fig. 2 The three different sets of points used to represent a priority sign (left), a warning sign (center), and a prohibitory sign (right). The dimensions of these models conform to the Italian standards (largest versions). All coordinates are expressed in millimeters for the general road configurations we dealt with in our tests. In fact, introducing all three angles would not affect the complexity of the fitness function, since the full transformation is already computed anyway but, of course, it would significantly increase the size of the PSO search space. A particle swarm can then be used to detect the presence of signs of a specific class within the image, by assigning a fitness value to each position estimate encoded by its particles. Such a value is proportional to the similarity between the projections onto the image plane of the points belonging to the sign model, obtained according to Eq. 5, and the corresponding image pixels. If the fitness value in the point (particle position) under evaluation is above a given threshold, we consider that point to be the location of a sign of the class associated to the swarm. This is the main feature that characterizes this algorithm. In fact, having an accurate estimation of the position and orientation of a sign offers the possibility to rectify its image by means of a simple Inverse Perspective Mapping (IPM) transform [1], in order to obtain a pre-defined view. This means it is always possible to obtain a standardized view (in terms of size and relative orientation) of the detected sign. This is the optimal input for a classifier whose purpose is to recognize the content of a detected sign irrespective of its actual orientation and distance. Having the signs pre-classified into different classes in the detection phase is a further significant advantage which makes recognition easier and more accurate, since a separate classifier can then be used for each class. At the moment, we only focus on detection: no classification of the signs which have been detected is performed, even if we plan to add a sign recognition module to our system in the immediate future. PSO is run at each new frame acquisition for a predefined number of generations. Actually, the algorithm structure and its GPU implementation permit to schedule 123 more than one PSO runs per frame. On the one side, this offers a second opportunity to detect a sign which was missed in the previous run on that frame. On the other side, it also allows each swarm to detect more signs belonging to the same class in the same frame. In fact, when a sign is detected in the first run, it is ‘hidden’ in order to prevent the corresponding swarms from focusing on it again during the subsequent runs. In the next subsection we describe the fitness function we use in our PSO based approach in details. 3.3 Fitness function Let us denote the three sets of points used to describe each sign class (see, for example, the models in Fig. 2) as S1 ¼ fs1i g, S2 ¼ fs2i g and S3 ¼ fs3i g, with sxi 2 R2 (they all lie on the xy plane), with i [ [1,16]. Based on the position encoded by one particle and on the projection matrix derived from the camera calibration, each set of points is roto-translated and projected onto the current frame, obtaining the corresponding three sets of points which lie on the image plane P1 ¼ fp1i g, P2 ¼ fp2i g and P3 ¼ fp3i g. To verify whether the estimated position is actually correct, three color histograms [33] in the HSV colorspace, one for each channel, are computed for each set Px with x [ {1, 2, 3}. Let us denote each of them as Hcx , formally defined as: n 1X dðIc ðpxi Þ bÞ ð6Þ Hxc ðbÞ ¼ n i¼1 where c [ {H, S, V} specifies the color channel, x [ {1, 2, 3} identifies the set of points, b [ [1, Nbin], (Nbin being the number of bins in the histogram), n represents the number of points in the set (sixteen in our case), the function d(n) returns 1 when n = 0 and zero otherwise and, finally, Ic ðpÞ : R2 ! R maps the intensity of channel c at pixel Evol. Intel. (2010) 3:155–169 161 location p to a certain bin index. The term 1n is used PNbin c to normalize the histogram such that b¼1 Hx ðbÞ ¼ 1. Moreover, three additional histograms, denoted as Hcref , are used as reference histograms for the red band surrounding all three sign models taken into consideration. The Bhattacharyya coefficient q [14], which offers an estimate of the amount of overlap between two statistical samples, is then used to compare the histograms. qðH1 ; H2 Þ ¼ Nbin pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X H1 ðbÞH2 ðbÞ: ð7Þ b¼1 The Bhattacharyya coefficient returns a real value between 0, when there is no overlap at all between the two histograms, and 1, when the two histograms are identical. Finally, if we use Sx;y ¼ H S S V V qðHH x ; H y Þ þ qðH x ; H y Þ þ qðH x ; Hy Þ 3 ð8Þ to express the similarity of the two triplets of histograms computed for the sets of points x and y, we can express the fitness function as f ¼ k0 ð1 S1;2 Þ þ k1 ð1 S2;3 Þ þ k2 S1;ref k0 þ k1 þ k2 ð9Þ where k0 ; k1 ; k2 2 Rþ are used to weigh the contributions of the three distances appearing in the above equation. Such a fitness function requires that: • • • histograms computed on the first two sets of points be as different as possible, hypothesizing that, in case the sign had been detected, the background color nearby the sign would differ significantly from the red band. the histogram of the points in the red band be as different as possible from the one computed on the inner area of the sign. histograms H c1 resemble as much as possible the reference histograms H cref computed for the red band surrounding the sign. Histograms of regions having colors that differ only slightly from the model, possibly because of noise, produce high values of S1,ref. The fitness function f will therefore be close to 1 only when the position of one particle is a good estimate of the sign pose in the scene captured by the camera. Actually, despite this being the most natural and general way to express this sort of fitness function, we noticed that system performances improved if we ignored the V (Value, or Intensity) channel at all. The reason for this stands probably in the fact that the lighting conditions of the region surrounding a sign are usually rather uniform, which makes intensity information useless for the discriminant properties of the fitness function, even if it affects its value. At the same time, we verified that in evaluating the reference (red) color it is preferable to neglect also the S (Saturation) channel. This means that, for the red band, we only use the H (Hue) channel, which is the only channel which encodes pure color information. 4 Experimental results The PSO parameters were set to w = 0.723, C1 = C2 = 1.193. Three swarms of 64 particles were run for up to 200 generations per frame to detect regulatory signs (of circular shape), warning signs (of triangular shape), and priority signs (of reversed triangular shape), respectively. The coefficients appearing in Eq. 9 were empirically set as follows: k0 = 1.4, k1 = 1.0, k2 = 0.8. The fitness threshold above which we considered a detection to have occurred was set to 0.9. The search space, in world coordinates, ranged from -4 m to 6.5 m along the horizontal direction (x axis), from 6 m to -1.6 m vertically (y axis), and from 9.5 m to 27 m in the direction of the car motion and of the camera optic axis (z axis). We finally allowed a range p4 ; p4 for sign rotation with respect to the vertical axis. All previous settings were set empirically after some preliminary tests in order for them to represent general values, independent of a particular sequence of images, defining a reasonable invariant region for the PSO to explore. The overall test of the system was divided in two separate phases. During the first one, synthetic video sequences were used to assess the ability of the system to correctly find and estimate the sign poses. During the second, more significant phase, real-world images were processed to assess the system’s detection performances in typical urban and suburban environments. 4.1 Tests on synthetic images In the first test phase, we simulated a 3D rural environment with a road and a pair of traffic signs using the public domain raytracer POV-Ray1 We relied on the Roadsigns macros by Chris Bartlett2 to simulate the signs and on some ready-to use objects by A. Lohmüller and F. A. Lohmüller3 to simulate the road. Bumps and dirtiness were added to the traffic signs in order to simulate more realistic conditions. Fig. 3 shows a sample frame from one of the synthetic sequences. As time passes, the simulated car moves forward zigzagging from left to right. At the same time, as they get closer to the car, the two signs rotate around their 1 2 3 http://www.povray.org. http://lib.povray.org/collection/roadsigns/chrisb2.0/roadsigns.html. http://f-lohmueller.de/pov_tut/objects/obj_500i.htm. 123 162 Evol. Intel. (2010) 3:155–169 Fig. 3 Sample frame taken from our synthetic video sequence simulating a country road scenario with two differently shaped roadsigns vertical axis. We introduced rotations to test the ability of our system to estimate the actual roto-translation between the camera and the sign which is detected. In fact, in our case, each particle moves in R4 and its position represents the x, y and z offsets of the sign as well as its rotation with respect to the vertical axis (yaw). Figure 4 shows three frames from the very beginning, from the middle, and from the end of the sequence, respectively. Image contrast has been reduced in this figure to better highlight the swarm positions. White points superimposed to the images represent the best-fitness estimate of the sign position, while black points depict the hypotheses represented by all other individuals. In Fig. 4.a it is possible to see the two swarms during the initial search phase: in this case both are on the wrong target despite being already in the proximity of a sign. Figure 4b, c show how the two swarms correctly converged onto their targets. For a more detailed performance analysis, Fig. 5 shows results obtained in estimating the actual position of the signs throughout the sequence. Figure 5 (top left) shows the actual x position and the estimated one (mean and standard deviation over one hundred runs), versus the frame number, for both the warning (light line) and the regulatory (dark line) signs. As can be seen, the horizontal position for the two signs is correctly detected until the end of the sequence with a precision of the order of centimeters. The sinusoidal trend of the two position reflects the zigzagging behaviour of the car we simulated. The top right part of the figure shows the results for the y coordinates. This time the actual position is constant since the simulated car is supposed to have a constant pitch. Again, the estimated position is correct with errors of just few centimeters. Similar considerations can be made for the bottom left graph of Fig. 5, which reports 123 Fig. 4 Output of a run of our road sign detection system at the very beginning (a), at middle length (b) and near to the end (c) of a synthetic video sequence results of depth (z coordinate) estimation. Even if signs are rather far from the car (about fifteen meters in the beginning) estimates are very good with errors of less than half a meter. The error is mostly due to the distance between the two most external sets of points, which introduces a tolerance in the estimation of the actual position of the target 3000 2000 163 (b) 1050 Regulatory: estimates pos Warning: estimated pos Regulatory: actual pos Warning: actual pos roadsigns’ y position (mm) (a) roadsigns’ x position (mm) Evol. Intel. (2010) 3:155–169 1000 0 -1000 -2000 -3000 -4000 Regulatory: estimated pos Warning: estimated pos Regulatory: actual pos Warning: actual pos 1000 950 900 850 800 750 700 650 600 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 roadsigns’ z position (mm) Regulatory: estimated pos Warning: estimated pos Regulatory: actual pos Warning: actual pos 18000 100 120 140 160 180 200 frame number 16000 14000 12000 10000 (d) 0.8 roadsigns’ Yaw angle (rad) frame number (c) 20000 80 0.6 8000 Regulatory: estimates pos Warning: estimated pos Regulatory: actual pos Warning: actual pos 0.4 0.2 0 -0.2 -0.4 -0.6 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 frame number 80 100 120 140 160 180 200 frame number Fig. 5 Position estimation errors: a shows the estimation of the horizontal position (x coordinate) for both signs, b shows the vertical position estimates, and c the depth estimates. The estimations of sign rotations around the vertical axis are shown in d border. Tightening this distance could improve the precision of the results but, at the same time, would make it more difficult to obtain high fitness values for signs which are far from the car because, in that situation, the depth value is large and the projections of these two sets of points on the image almost overlap, producing color histograms which are very similar. Finally, in the bottom right part of the figure we show results of yaw estimation. In this case results are not as precise: it is possible to see how the rotation of the warning sign is estimated rather precisely, even if, in the first third of the simulation, the standard deviation is very high, while the rotation of the regulatory sign seems to be randomly estimated. Since we are dealing with small angles this must be due, again, to the tolerance in locating the signs introduced by the distance between the two external sets of points, which is more likely to affect fitness when matching circular signs than with triangular ones. 4.2 Real-world tests The system was then validated on two real-world image sequences acquired with cameras placed on-board a car. We compared results obtained by our system with those obtained by a road sign recognition system previously developed by some of the authors [2], affiliated to VisLab4, the computer vision laboratory of our Department at the University of Parma. For this reason we will refer to such a system as the VisLab system from now on. The benchmark used for this comparison comprised two of the sequences that were acquired during its development. The first sequence, which includes 10000 frames having a resolution of 750 9 480 pixels, acquired at 7.5 fps by a PointGrey Firefly color camera, and is therefore about 22 min long, was recorded while driving on the ‘Parma orbital’ on a sunny day. Since the road circumnavigates the town and the sequence includes a full tour of the orbital, the sequence contains images featuring all possible light orientations. The second sequence, having the same resolution, is little less than 5000 frames long and was acquired by an AVT Guppy color camera at 7.5 fps on the ‘Turin orbital’ on a cloudy day. Images in this sequence feature more constant lighting but lower contrast. A significantly long segment of this sequence has been acquired while driving in an urban environment, while most of the first 4 http://www.vislab.it. 123 164 sequence has been acquired on a separate-lane road. In the two sequences, the car on which the cameras were mounted runs at speeds ranging from 0 (a few crossings and roundabouts are present) to more than 70 km/h. Using the settings reported above, signs were detected at an average distance from the car of about 12 m Fig. 6 displays screenshots of the system, taken while running it: the main window area shows the actual camera view being currently processed, while the right vertical panel keeps track of the last road sign detected for each one of the three categories (regulatory, warning, and priority, see caption). In Fig. 7 it is possible to see some examples of the signs which could be detected, as they appear after rectification. Signs (a)–(c) were detected with normal light conditions. Signs (d) and (e) had direct sunlight, while sign (f) had a strong backlight. Images (g) and (h) were the only two false positives we ever observed during all our tests: image (g) shows a red car partially occluded by another one, which Fig. 6 Sample screen snapshots taken while processing realworld images. On the right the images of the detected signs, correctly rectified to a standard frontal view, are shown: the top frame shows prohibitory signs, the middle frame warning signs, and the bottom frame priority signs. Images of the last detected signs remain visible on the window until a new sign of the same category is detected. The rectangle highlights the sign which was last detected 123 Evol. Intel. (2010) 3:155–169 creates an inverted triangular shape with a partially red border, and was detected as a priority sign, while (h), a detail of a commercial poster showing a red line which forms a triangle with a red border, was (correctly?) detected as a warning sign. Sign (i) was partially covered by a tree. Sign (j) was depicted inside a bigger yellow panel. The internal region of signs (b) and (i)–(l) is not white, which highlights the characteristic of the system of looking for a red band coupled with two color discontinuities on either sides. Finally, signs (m)–(o) were correctly detected in extremely poor lighting conditions: in the original frames, which are shown in Fig. 8a, they were difficult to see even by humans. Looking again at the rectified sign (a), and comparing it to the original image showed in Fig. 8b it is possible to notice how the rectification process almost reduced to zero the sign rotation with respect to the vertical axis, causing it to appear as if it had been observed from a perpendicular frontal viewpoint. Evol. Intel. (2010) 3:155–169 165 Fig. 7 Some of the signs that were detected, after rectification. Our system is able to detect signs in very different lighting conditions A validation software based on the unique code assigned to each sign permitted to further assess the effectiveness of the system. Based on the annotations made during the development of the VisLab system, the validation software produced detection statistics in terms of false positives, false negatives, and correct detections (true positives). The above statistics were compared with those obtained, on the same sequence, by the VisLab system. Such a 123 166 Fig. 8 Original frames where signs (m)–(o) (above), and a (below) of Fig. 7 were detected system implements a three-steps process (see Fig. 9): color segmentation, shape detection, and classification based on several neural networks. Since illumination conditions deeply affect the first step, a particular gamma correction method has been developed to compensate for color drifts typical of sunset or dawn [2]. Results obtained on the sequence shot in Parma are compared in Table 1, while data in Table 2 compares results obtained on the sequence shot in Turin. Since PSO is a stochastic algorithm, we also assessed our system’s performance repeatability by executing several runs for each sequence. The tables report, for our system, the best and worst results obtained in the test runs. Fig. 9 Block diagram of the VisLab system used as reference (from [2]) 123 Evol. Intel. (2010) 3:155–169 In general, the performances of our system were comparable, and often better, than those of the VisLab system, in terms of correct detections. The system was also very selective and only generated two false positives (reported in Fig. 7) which, curiously enough, were only noticed while annotating the results by hand and were not counted as false positives by the validation software because, by chance, in the same frames there was actually a sign of exactly the same kind. Considering the conditions in which the two test sequences have been acquired and the results yielded by the two systems on such sequences, our system seems to be more robust than VisLab’s with respect to light changes and critical conditions such as backlight or blinding. In fact, while results on the sequence shot in Turin are comparable, our system outperforms the VisLab’s if the sequence shot in Parma is taken into consideration. The only exception, noticeable in both sequences, regards priority signs. This might seem to contrast with the good performances obtained on the warning signs, with which they share shape and reference point sets, differing only for orientation. However, one should notice that the pole on which the signs are usually mounted intersects the warning signs contour in a position (the middle of one side) where few reference points are present, while it does so close to a vertex of the priority sign, where reference points for such a sign are denser. This suggests that the fitness function may be affected negatively by such an ‘interference’ and that a different distribution of reference points for such a sign class may limit this problem. Another possible justification might be offered by the fact that priority signs are often mounted above other signs, such as roundabout signs, which could also affect the fitness function. 4.3 Computation efficiency The CUDA implementation of the whole system described above permitted to achieve very good execution times. Experiments were run on two different machines. The first one is equipped with a 64-bit Intel(R) Core(TM)2 Duo Evol. Intel. (2010) 3:155–169 167 Table 1 Results for the CUDA-PSO and the VisLab (reference) systems on the ‘Parma orbital’ sequence System VisLab (reference) # Category (tot) False Pos False Neg # Priority (29) 4 9 (31%) # Prohibitory (51) 1 33 (64.7%) # Warning (44) 0 # Total (124) 5 CUDA-PSO-RSD Correct Det False Pos min–max False Neg min–max Correct Det min–max 20 (69%) 1–1 9–10 (31–34.5%) 19–20 (65.5–69%) 18 (35.3%) 0–0 20–26 (39.2–51%) 25–31 (49–60.8%) 23 (52.3%) 21 (47.7%) 1–1 10–13 (22.7–29.5%) 31–34 (70.5–77.3%) 65 (52.4%) 59 (47.6%) 2–2 39–49 (31.5–39.5%) 75–85 (60.5–68.5%) Table 2 Results for the CUDA-PSO and the VisLab (reference) systems on the ‘Turin orbital and downtown’ sequence System VisLab (reference) CUDA-PSO-RSD # Category (tot) False Pos False Neg Correct Det False Pos False Neg min–max Correct Det min–max # Priority # Prohibitory (14) (53) 0 4 1 (7.1%) 22 (41.5%) 13 (92.9%) 31 (58.5%) 0–0 0–0 4–8 (28.6–57.1%) 20–22 (37.7–41.5%) 6–10 (42.9–71.4%) 31–33 (58.5–62.3%) # Warning (45) 0 10 (22.2%) 35 (77.8%) 0–0 7–8 (15.6–17.8%) 37–38 (82.2–84.4%) # Total (112) 4 33 (29.5%) 79 (70.5%) 0–0 31–38 (27.7–33.9%) 74–81 (66.1–72.3%) processor running at 1.86 GHz with a moderately priced GeForce 8800GT video card by nVIDIA, equipped with 1 GB of video RAM and fourteen multi-streaming processors (MPs), which adds up to a total of 112 processing cores. The second one is powered by a 64-bit Intel(R) Core(TM) i7 CPU running at 2.67 GHz, combined with a top-level Quadro FX5800 graphics card, also by nVIDIA, having 4 Gb of video RAM and thirty MPs or, in other words, 240 processing cores. In spite of the great disparity between the two setups, no major differences were observed between the execution times. This suggests that our system, in spite of the impressive amount of operations required, still does not saturate the computational power of this kind of GPUs. All our tests were performed off-line. The input video sequences were encoded as a set of single shots to be singularly loaded from disk with no streaming. Therefore, in all timing operations we took care to exclude disk input/ output latency times. In fact, live recordings from a camera connected to the PC would permit to directly transfer images data into the computer RAM in almost negligible time, avoiding slow disk accesses. With this setup, we obtained good performances with many different system settings. For example, employing simultaneously 3 swarms (one per each sign class under consideration) composed by 64 particles, and activated twice per frame for 200 generations, yielded a processing speed of about 20 frames per second (about 50 ms of actual processing time per frame). The frame rate improved to about 24 fps if each swarm was run just once per frame. Using swarms of 32 particles resulted in about 30 fps and 48 fps when two runs or one single run per frame were executed, respectively. In another set of tests, where only 100 generations per run were simulated, processing speed further increased to about 50 fps executing two PSO runs per frame and 65 fps with a single run. Running each swarm more than once and ‘hiding’, in the subsequent runs, any sign that has been previously detected allows a swarm to deal effectively with the presence of pairs of signs of the same class in the same frame, an event which occurs rather frequently. A sequential version of the algorithm would require up to about half a second for the most demanding (and performing) of the settings reported above. This means that the same system would hardly process video at 4–5 fps, a processing speed which is not acceptable for this kind of application. Therefore, our system is able to detect many types of road signs processing images at a speed close to full frame rate. Considering that many existing systems actually work at speeds that are significantly lower than full frame rate, we can also say that some time remains available to perform sign classification. From this point of view, being able to reconstruct a frontal view of given size for each sign which is detected by our system allows for using rather simple classifiers in the recognition stage. Therefore, we expect that embedding such a stage in our system, which is actually our final goal, will not affect processing speed significantly. Even if it is not possible to directly compare the processing performance of our system to the ones of the VisLab system (which is stated to be able to run at about 123 168 13 fps including sign classification on a dual-processor Pentium PC with clock frequency of 2 GHz), the performances of our system appear to be competitive with our reference. Evol. Intel. (2010) 3:155–169 5. 5 Conclusions and future directions 6. We have shown that PSO, provided with a suitable fitness function, can effectively detect traffic signs in real time. Experimental results on both synthetic and real video sequences showed that our system is able to correctly estimate the position of the signs it detects with a precision of about ten centimeters in all directions, with depth being (rather obviously) the less accurate. We have considered signs of three possible classes, which, in normal road settings, account for far more than half of the occurrence of signs. In any case, the three classes taken into consideration are also those which are more likely to be confused among one another. In fact, the sign classes omitted from our analysis have features that are mostly ‘orthogonal’ to the one that characterize the signs presently sought and to one another’s. These considerations, along with the modularity of our system, lead us to expect that extending our system to those other classes could be feasible by introducing only small changes in the system, obtaining comparable results in terms of quality. As concerns processing speed, the considerations about scalability made in Sect. 4.3 also induce optimistic expectations. Finally, the way our system detects road signs permits us to re-project all the images of the detected signs back to a standard frontal view which represents the optimal input for a classification step. Because of this, the introduction of a classification module into the system has the highest priority in our near-future agenda. Acknowledgments We would like to express our thanks and appreciation to Gabriele Novelli, Denis Simonazzi and Marco Tovagliari for their help in tuning, testing and assessing the performances of our system. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. References 1. Mallot HA, Bülthoff HH, Little JJ, Bohrer S (1991) Inverse perspective mapping simplifies optical flow computation and obstacle. Biol Cybern 64(3):177–185 2. Broggi A, Cerri P, Medici P, Porta PP, Ghisio G (2007) Real time road signs recognition. In: Proceedings of IEEE intelligent vehicles symposium 2007, Istanbul, Turkey, pp 981–986 3. Cagnoni S, Mordonini M, Sartori J (2007) Particle swarm optimization for object detection and segmentation. In: Applications of evolutionary computing. Proceeding of EvoWorkshops 2007, Springer, pp 241–250 4. Cagnoni S, Mussi L, Daolio F (2009) Empirical assessment of the effects of update synchronization in Particle Swarm 123 19. 20. 21. 22. Optimization. In: Poster and workshop proceedings of the XI conference of the Italian association for artificial intelligence. Reggio Emilia, Italy (2009). Electronic version, ISBN 978-88903581-1-1 Anton Canalis L, Hernandez Tejera M, Sanchez Nielsen E (2006) Particle swarms as video sequence inhabitants for object tracking in computer vision. In: Proceedings of IEEE international conference on intelligent systems design and applications (ISDA’06), pp 604–609 Engelbrecht AP (2007) Computational Intelligence: an Introduction, 2nd edn. Wiley, England de la Escalera A, Armignol JM, Mata M (2003) Traffic sign recognition and analysis for intelligent vehicles. Image Vis Comput 21(3):247–258 de la Escalera A, Moreno LE, Puente EA, Salichs MA (1994) Neural traffic sign recognition for autonomous vehicles. In: Proceedings of IEEE 20th international conference on industrial electronics, control and instrumentation 2:841–846 Gao X, Shevtsova N, Hong K, Batty S, Podladchikova L, Golovan A, Shaposhnikov D, Gusakova V (2002) Vision models based identification of traffic signs. In: Proceedings of European conference on color in graphics image and vision. Poitiers, France, pp 47–51 Gavrila D (1999) Traffic sign recognition revisited. In: Mustererkennung 1999, 21. DAGM-symposium. Springer, pp 86–93 Hoessler H, Wöhler C, Lindner F, Kreßel U (2007) Classifier training based on synthetically generated samples. In: Proceedings of 5th international conference on computer vision systems. Bielefeld, Germany Ivekovic S, John V, Trucco E (2010) Markerless multi-view articulated pose estimation using adaptive hierarchical particle swarm optimisation. In: Di Chio C et al (eds) Applications of evolutionary computing: proceedings of EvoApplications 2010, Istanbul, Turkey, Part I, LNCS 6024, Springer, pp 241–250 Jiang G-Y, Choi TY (1998) Robust detection of landmarks in color image based on fuzzy set theory. In: Proceedings of IEEE 4th international conference on signal processing 2:968–971 Kailath T (1967) The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans Commun Technol 15(1):52–60 Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings IEEE international conference on neural networks, IV, IEEE, New York, pp 1942–1948 Kennedy J, Mendes R (2002) Population structure and particle swarm performance. In: Proceedings of congress on evolutionary computation—CEC, IEEE, pp 1671–1676 Hyukseong K, Park J, Kak A (2007) A new approach for active stereo camera calibration. In: Proceedings of IEEE international conference on robotics and automation, pp 3180–3185 Liang J, Qin A, Suganthan P, Baskar S (2006) Comprehensive learning particle swarm optimizer for global optimization of multimodal functions. IEEE Trans Evol Comput 10(3):281–295 Loy G, Barnes N (2004) Fast shape-based road signs detection for a driver assistance system. In: Proceedings of IEEE/RSJ international conference on intelligent robots and systems, Sendai, Japan, pp 70–75 Makoto M, Takuji N (1998) Mersenne Twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans Model Comput Simul 8(1):3–30 Mussi L, Cagnoni S (2008) Artificial creatures for object tracking and segmentation. In: Applications of evolutionary computing: proceedings of EvoWorkshops 2008, Springer, pp 255–264 Mussi L, Cagnoni S (2009) Particle swarm for pattern matching in image analysis. In: Serra R et al (eds) Proceedings of WIVACE 2008, Italian Workshop on artificial life and evolutionary computing, World Scientific, pp 89–98 Evol. Intel. (2010) 3:155–169 23. Mussi L, Daolio F, Cagnoni S (2010) Evaluation of particle swarm optimization algorithms within the CUDA architecture. Inf Sci. doi:10.1016/j.ins.2010.08.045 24. Nguwi Y, Kouzani, A (2006) A study on automatic recognition of road signs. In: Proceedings of IEEE conference on cybernetics and intelligent systems. Bangkok, Thailand, pp 1–6 25. nVIDIA Corporation (2009) nVIDIA CUDA Programming Guide v. 2.3. http://www.nvidia.com/object/cuda_develop.html 26. Montes de Oca M, Stützle T, Birattari M, Dorigo M (2009) Frankenstein’s PSO: a composite particle swarm optimization algorithm. IEEE Trans Evol Comput 13(5):1120–1132 27. Owechko Y, Medasani S (2005) A swarm-based volition/attention framework for object recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition—workshops (CVPR’05). IEEE, pp 91–91 28. Poli R (2008) Analysis of the publications on the applications of particle swarm optimisation. J Artif Evol Appl 2008(1):1–10 29. Poli R, Kennedy J, Blackwell T (2007) Particle swarm optimization: an overview. Swarm Intel 1(1):33–57 30. Tsai RY (1987) A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-theshelf TV cameras and lenses. IEEE J Robot Autom 3:323– 344 31. Shadeed WG, Abu-Al Nadi DI, Mismar MJ (2003) Road traffic sign detection in color images. In: Proceedings of IEEE 10th international conference on electronics, circuits and systems 2:890–893 169 32. Soetedjo A, Yamada K (2005) Fast and robust traffic sign detection. In: Proceedings of IEEE international conference on systems, man and cybernetics 2:1341–1346 33. Sonka M, Hlavac V, Boyle R (2007) Image processing, analysis, and machine vision, 3rd edn. CL-Engineering 34. Taiana M, Nascimento J, Gaspar J, Bernardino A (2008) Samplebased 3D tracking of colored objects: a flexible architecture. In: Proceedings of British machine vision conference (BMVC’08). BMVA, pp 1–10 35. Vitabile S, Pollaccia G, Pilato G (2001) Road signs recognition using a dynamic pixel aggregation technique in the HSV color space. In: Proceedings of international conference on image analysis and processing, Palermo, Italy, pp 572–577 36. Wei GQ, Ma SD (1994) Implicit and explicit camera calibration: theory and experiments. IEEE Trans Pattern Anal Machine Intell 16(5):469–480 37. Zhang Z (2000) A flexible new technique for camera calibration. IEEE Trans Pattern Anal Machine Intell 22(11):1330–1334. http://research.microsoft.com/*zhang/Calib/ 38. Zhang X, Hu W, Maybank S, Li X, Zhu M (2008) Sequential particle swarm optimization for visual tracking. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR’08). IEEE, pp 1–8 39. Ziraknejad N, Tafazoli S, Lawrence P (2007) Autonomous stereo camera parameter estimation for outdoor visual servoing. In: Proceedings of IEEE Workshop on machine learning for signal processing, pp 157–162 123