IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 1, FEBRUARY 2012 89 A Hybrid Approach for Building Extraction From Spaceborne Multi-Angular Optical Imagery Anish Turlapaty, Balakrishna Gokaraju, Qian Du, Senior Member, IEEE, Nicolas H. Younan, Senior Member, IEEE, and James V. Aanstoos, Senior Member, IEEE Abstract—The advent of high resolution spaceborne images leads to the development of efficient detection of complex urban details with high precision. This urban land use study is focused on building extraction and height estimation from spaceborne optical imagery. The advantages of such methods include 3D visualization of urban areas, digital urban mapping, and GIS databases for decision makers. In particular, a hybrid approach is proposed for efficient building extraction from optical multi-angular imagery, where a template matching algorithm is formulated for automatic estimation of relative building height, and the relative height estimates are utilized in conjunction with a support vector machine (SVM)-based classifier for extraction of buildings from non-buildings. This approach is tested on ortho-rectified Level-2a multi-angular images of Rio de Janeiro from WorldView-2 sensor. Its performance is validated using a 3-fold cross validation strategy. The final results are presented as a building map and an approximate 3D model of buildings. The building detection accuracy of the proposed method is improved to 88%, compared to 83% without using multi-angular information. Index Terms—Building extraction, height estimation, multi-angular optical data. I. INTRODUCTION U RBAN sprawl change plays an important role in efficient planning and management of the city against various challenges. Real-time and precise information is in high demand by federal, state, and local resource management agencies to dispense effective decisions for the challenges such as environmental monitoring and administration, utility transportation, land-use/land-cover change detection, water distribution, and drainage systems of urban areas [1]. Very high resolution spaceborne sensors are a cost and time effective choice for region-based analysis to produce detailed urban land-cover maps. The high resolution capability enables the detection of urban objects such as individual roads and buildings, leading to automation in land mapping applications, cartographic digital Manuscript received August 06, 2011; revised October 31, 2011; accepted November 23, 2011. Date of publication January 04, 2012; date of current version February 29, 2012. A. Turlapaty is with the Department of Electrical and Computer Engineering, Mississippi State University, Mississippi State, MS 39762 USA. Q. Du is with the Department of Electrical and Computer Engineering, Mississippi State University, and also with the Geosystems Research Institute, Mississippi State University, Mississippi State, MS 39762 USA (corresponding author, e-mail: du@ece.msstate.edu). N. H. Younan is with the Department of Electrical and Computer Engineering, Mississippi State University, and also with the Geosystems Research Institute, Mississippi State University, Mississippi State, MS 39762 USA. B. Gokaraju and J. V. Aanstoos are with the Geosystems Research Institute, Mississippi State University, Mississippi State, MS 39762 USA. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JSTARS.2011.2179792 mapping, and geographic information system (GIS). Building extraction has been an active research area for the last three decades [2]–[4]. Recent approaches for building extraction include: 1) active contour models (ACM)-based detection on high resolution aerial imagery with possible utilization of light detection and ranging (LIDAR) data, 2) energy point process method on digital elevation model (DEM) data, 3) height estimation from synthetic aperture radar (SAR) data, 4) classification-based approaches on optical imagery, and 5) hybrid approaches combining either height estimation or contour methods with classification techniques on optical imagery. Early research on building extraction was focused on aerial imagery due to its high spatial resolution. Region-based ACM driven geometric snakes (GS) are more suitable for aerial imagery, since they rely on intensity, texture, and statistical features of the pixels. Another advantage of GS is their ability to detect topological changes without noticeable edge data [5]. The existing GS approach for building detection was improved by training the model with gray level values from target regions. The traditional snake-based building detection was improved through the incorporation of digital surface model (DSM) and height data [6]. The DSM was generated from existing LIDAR data and a DEM. Later, approximate contours and initial points of interest were identified and the snakes were optimized to fit the building contours. The standard approach for DEM-based building extraction depends on functional minimization with regularization [7]. This method was computationally improved by modification of the energy-based objective function in the original formulation [8]. In another study [9], void and elevated areas were extracted from LIDAR data. Using normalized difference vegetation index (NDVI) data and edge detection of void areas, building shape extraction was improved. The limitation of ACM-based methods described in [6] is that they perform well on isolated buildings, but further research is required to analyze their performance on closely spaced and connected buildings. Additionally, the images have nearly uniform background [6], [9], thus there are no non-building objects with edges or elevated areas. In contrast, the images of commercial urban scenes have complex backgrounds requiring a more sophisticated approach. ACMs seem to perform well when the buildings are the dominant category in the image scene [5], [6], [9]. The extraction methods described in [5] are tested on aerial images with only rectangular buildings, not on more complex shapes. There are also many research studies proposing the use of SAR for building extraction [10]–[13]. However, our research is focused on building extraction using optical satellite images. 1939-1404/$31.00 © 2012 IEEE 90 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 1, FEBRUARY 2012 Regarding height estimation from spaceborne optical images, various approaches, such as template matching, dynamic programming (DP), and a variant version of DP, were analyzed [14]. In the template matching method, an image is searched for the best correlated area for a given template from another image. The DP approach tracks epipolar lines in two images and is formulated as an optimization problem by using a dissimilarity function. In the modified DP approach, the mutual information function is used instead of a dissimilarity function. The height precision for each of the methods was close to 1 m. Several classification-based building extraction methods were also developed for spaceborne images. These algorithms usually operate on both panchromatic and multispectral image sets. Some of the algorithms include a neural network classification of spectral and structural features [15] and a Bayesian classifier approach [16]. A hybrid approach for building extraction was developed using support vector machine (SVM)-based binary classification [17]. This approach utilizes a combination of the intensity values from an ortho-rectified image, NDVI, and normalized DSM (nDSM). In [18], it was pointed out that the radiometric resolution of a sensor plays a significant role in the shadow detection and removal. It was found that many simple algorithms (such as bimodal histogram splitting) provided the best chances for detecting shadow from non-shadow regions in the imagery. This problem could be easily addressed using very high radiometric resolution sensor data [19]. The existing building extraction methods from spaceborne optical imagery also have some limitations. For instance, the method in [14] was limited to the panchromatic images with little variation in terms of the relative size, shape, and separation of buildings. Moreover, the images in this study [14] had hardly any non-building objects above the ground level. These methods need further investigations on complex image scenes with multiple bands. In the Hough-transform-based approach [17], the authors assumed that buildings were either rectangles or circles. This assumption may not be valid for complex building shapes. Dynamic programming was successfully used in template matching especially for one-dimensional data, such as speech recognition problems [20]. It was also useful for the analysis of epipolar stereo image pairs with low building height variations [21]. Since the WorldView-2 dataset used in this study has images from several different view angles and the building heights have high degree of variance, dynamic programming may not be applicable without significant modifications. In this paper, we propose a hybrid building extraction and relative building height estimation methodology for an urban scene using the multi-angular characteristics of optical imagery from the WorldView-2 sensor. The objective is to improve building extraction performance by appropriately utilizing optical multiangular information. Our major contributions are: 1) an alternative approach is proposed for shadow detection using yellow and red bands of multispectral optical imagery from WorldView-2; 2) a relative height estimation methodology based on a template matching is developed; and 3) a hybrid approach is proposed to significantly improve building extraction accuracy using the results from the template matching, followed by classification of the image using multi-band spectral features. The rest of the paper is organized as follows. Section II consists of the proposed methodology of a hybrid approach with mathematical details. Section III consists of its application to WorldView-2 imagery, followed by classification and discussion. Finally, Section IV concludes with a summary and proposed future work. II. METHODOLOGY A block diagram of the proposed hybrid method is illustrated in Fig. 1. First, pan-sharpening is conducted to enhance the spatial resolution of multispectral images. Second, the pan-sharpened data is used to detect and remove shadow and vegetation regions, followed by a template matching technique for height estimation and potential building segment detection. Third, statistical features are extracted from potential building segments, which are used for SVM-based supervised classification for final building extraction. A. Pan-Sharpening Consider a panchromatic image of size pixels with resolution m and a multispectral (MS) image set from the same satellite with resolution m and spectral bands. In the pansharpening stage, the first step is to resample the original MS imagery to resolution m by using cubic spatial interpolation. The next step is to apply the Gram-Schmidt orthogonalization based pan-sharpening method in the ENVI package [22] on the panchromatic image and the re-sampled MS image. The pansharpened (PS) image set is the input to the proposed hybrid approach. B. Height Estimation 1) Shadow and Vegetation Removal: Shadows pose a serious impediment in our hybrid approach and can be confused with other class information, thus their removal is crucial. Vegetation detection helps avoid the challenges that can be encountered in a template matching with objects such as tall trees. In addition, this detection enables reducing significant percentage of nontarget information load for further processing. Finally, after the masking of shadows and vegetation, the resulting image consists of only buildings, and other man-made structures such as roads, parking lots, and pavements, just to mention a few. The NDVI is calculated by using the near Infrared (NIR) and Red bands of the pan-sharpened image as (1) The vegetation mask is obtained by using a threshold from the NDVI image. This threshold can be determined with the maximum between-class variance (MBCV) criterion [23], [24]. In the MBCV method, it is assumed that the histogram of an image would have two major modes, and the threshold is selected such that the inter-class variance between these two modes is maximized so as to maximize the separability of these two modes [23], [24]. TURLAPATY et al.: A HYBRID APPROACH FOR BUILDING EXTRACTION FROM SPACEBORNE MULTI-ANGULAR OPTICAL IMAGERY 91 Fig. 1. Block diagram of the proposed building extraction approach, consisting of three major stages: (1) pan-sharpening, (2) relative height estimation, and (3) building extraction. Previous studies show that the normalized shadow index (NSI) and the normalized spectral shadow (NSS) produce very low index values for shadows but often mix up with the surrounding water bodies and vegetation [25]. Similarly, shadow regions are processed by transforming the histogram for grey scale compensation to discriminate from non-shadow regions [26]. Shadow regions are also detected by transformation to Hue-Saturation-Intensity (HSI) and then removing the NDVI-based vegetation [27]. In the current research, the product of yellow and red bands in (2) is used for shadow detection. (2) For a given pixel at , if ,, then it is categorized as shadow. A threshold value is determined using the MBCV of the image [23], [24]. The lowest mode of the histogram represents the shadows in the image. Once the vegetation and shadow regions in the image are screened out, the remaining image is used as an input to the subsequent processing of height estimation. 2) Template Matching for Potential Building Extraction: The template matching is commonly used in several image processing applications, such as object tracking for forward looking infrared imagery [28], face recognition using normalized gradient matching [29], and human detection based on body shape [30]. The method deals with two stereo images; one image consists of templates of target objects and the goal is to locate these templates in the other image. Usually, the search image is of slightly different view angle to that of the template image. In this paper, we propose to use this technique for relative height estimation of a given image scene based on the template displacements. Consider two -th band images and at different view angles and . In the image, each template is tracked in the image to estimate the relative height, where is the center pixel. In this study, the template is a spatial window of size pixels in the image. The displacement corresponding to a template in the image is assumed to be (3) where is the matching area in the second image, centered at pixel . As illustrated in Fig. 2, the height of a building can be related to this displacement with the cosine rule: (4) where and are two different incidence angles, and is the angle between ground projections of sensor lines of sight. Here, the angle depends only on the view angles of the satellite [31]. The angles and depend on the building height. However, compared to the altitude of the satellite, the influence of the building height variation is negligible. Hence, from (4), the building height is linearly related to the displacement , and can be used as a relative height. Initially, for a given template, a corresponding search area, called frame, is selected from the image. This frame has to be large enough to contain even the farthest displacement from , which depends on the height of the building. The frame size depends on the difference between view angles of the two images. The template size and frame size are not usually equal. When computing the FFT-based correlation, the frame and template are padded with zeros to the nearest larger power of size 2. 92 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 1, FEBRUARY 2012 Fig. 2. An illustration of the relation between template displacement and object height [31]. The 2D Fourier transforms of both the template and frame are computed as [32] (5) (6) and their correlation Fourier transform: is computed with the 2D inverse (7) Then, the normalized correlation classification for building extraction are: 1) higher computational load for the classification algorithm, 2) possibility of higher false positives because of the similarity of blocks or pixels of buildings with those from non-building regions, and 3) lack of sufficient spatial and spectral variability for efficient classification. Building detection from high resolution satellite images requires a resolution of at least 5 m, which enables the expression of buildings as objects rather than mixed pixels. Hence, for a high resolution image with 0.5 m resolution, a block size of 10 10 is equivalent to a 5 m resolution. Finally, an appropriate block size has to be chosen for buildings in the imagery because a smaller size would not be sufficient to attain enough spatial variance, whereas the very large size would have mixed inter classes such as parking lots and roads [36]–[38]. Feature extraction is crucial for block-based image classification since the features computed for each block represent the spatial and textural properties of that block. Moreover, feature vectors play a key role in developing the classification model as they constitute the bridge between high-level object category (abstraction) and raw pixel intensity. The image is divided into a spatial block of size , and the total number of blocks is given by , where and . For each block, the statistical features, such as mean , variance , skewness , kurtosis , and energy , are extracted. 2D-wavelet-based features are also computed as follows. Initially, a 2D wavelet decomposition is applied to each spatial block using the following recursive filtering equations at the ( ) desired level of decomposition with initial : is determined by (9) (8) where consists of a regional sum at each pixel of the frame and is the standard deviation. The regional sum is computed for a window with a size equal to that of a template. Here, is the number of pixels in the template. The point used in (3) and (4) is the location with the largest correlation . The maximum correlation value has to be greater than the selected threshold for buildings discrimination. Otherwise, it is a planar template [33]–[35]. The matching template is selected based on the maximum correlation. In the case of multiple matches, the closest match is retained. By repeating (5)–(8) for all the templates in the image, the relative height can be estimated for the whole scene. The output after the template matching is segments that primarily are for buildings. C. Building Extraction At this stage, a block classification approach is chosen based on the dimensions of the buildings in the image. For example, some buildings occupy hundreds of pixels in one dimension and few occupy tens of pixels in the other dimension. In addition, some disadvantages with pixel-level or very small block-level , , represent the approximation, horizontal, where , vertical, and diagonal coefficients, respectively, and and denote highpass and lowpass filters of the selected Haar mother wavelet, respectively. The statistical features in the wavelet domain, such as energy “ ”, entropy “ ”, maximum coefficient “ ”, and mean “ ”, are computed. A feature vector for the -th block in the -th band is constructed as: (10) where . The final feature matrix with all the bands is . After the minimum building height is detected, it is used as a threshold for the screening of the samples that certainly do not correspond to buildings. As a result of this screening, most of the remaining feature vectors correspond to buildings, and only a few belong to non-buildings (such as roads and parking lots). Now, the SVM is adopted to classify buildings from the nonbuilding classes, where building detection has been formulated into a binary classification problem. The primary goal of SVM is to devise an optimal hyperplane that maximizes the marginal TURLAPATY et al.: A HYBRID APPROACH FOR BUILDING EXTRACTION FROM SPACEBORNE MULTI-ANGULAR OPTICAL IMAGERY 93 distance from the support vectors of each class [39]–[43]. The sample data is grouped into training and testing data. A 3-fold cross validation strategy is employed. The feature set is divided into three folds. In each validation iteration, 2 folds (66.66% data) are used for training and 1 fold (33.34% data) is used for testing; thus, after three iterations, each of the 3 folds are both tested and trained. By using the results of the testing component of the SVM classification, a building map is yielded for the image scene. The areas of individual segments of this map are computed and the smaller segments are discarded based on the minimum building area. This threshold can be detected using MBCV for the above segment areas. Finally, the resulting map consists of buildings only. The building map is overlaid on the relative height map to obtain the height data for individual buildings. Then, by averaging this height data separately for each building, an approximate 3D model can be generated. III. EXPERIMENT AND RESULTS The dataset used in the experiment comprises of five multiangle ortho-rectified level-2a panchromatic and multispectral images from the WorldView-2 sensor [44]. The multispectral image consists of eight spectral bands, namely, coastal, blue, green, yellow, red, NIR1, NIR2, and NIR3, ranging from 400 nm to 1040 nm with a spatial resolution of 2 m. The panchromatic image covering 450 nm to 800 nm has a 0.5 m spatial resolution. The image set was acquired over the downtown region of Rio de Janeiro in Brazil in January 2010 within a three minute time frame. The scene includes a large number of tall buildings, wide roads, parking lots, and a mixture of community parks and plantations. The five incidence angles of the sensor, i.e., 10.5 (view 3), 21.7 (view 2), 26.7 (view 4), 31.6 (view 1), and 36.0 (view 5), were used in this research. For further details on azimuth angles, please refer to [44]. A. Pan-Sharpening Result From the original image at view 3, two subscenes of size 525 525 and 500 500 pixels, as shown in Fig. 3, were studied. The panchromatic image was fused with the re-sampled MS images, resulting in pan-sharpened images as illustrated in Fig. 4. It is clearly observed that the edges of the objects are very distinct in the panchromatic image compared to the MS data. In the pan-sharpened output, the spectral signatures of the MS imagery and spatial features of the panchromatic image are very well retained. B. Template Displacement Estimation 1) Detection of Shadows and Vegetation From Pan-Sharpened Images: The pixel level product of the yellow and red bands of the pan-sharpened image was computed. The histogram of the logarithmic values of this product image presents a bimodal distribution of shadows and non-shadows as shown in Fig. 5, from which a threshold value of 4.41 was determined. The pixels with a log-product value less than this threshold were considered as shadows. By using (3) on the NIR3 ( ) and Red ( ) bands, vegetation pixels (enclosed by the yellow polygons) were identified with a very high accuracy as shown in Fig. 6. There was a reduction of approximately 30% of non-building pixels for further analysis. Fig. 3. Panchromatic images of two scenes in the image at view 3 of Rio De Janeiro from the WorldView-2 sensor. (a) scene 1 of size 525 525. (b) scene 2 of size 500 500. 2) Template Matching: The goal in this step is to further screen most of non-building features before the subsequent classification process. The first image at was used for extracting templates. The pan-sharpened image at is the second image for extracting frames. The second image consists of additional 200 rows and 50 columns towards the south and west, respectively; it is used to detect displacement of large buildings from the first image. The view angle of the second image is further away in the northeastern direction from the nadir compared to the first image. Because of this reason, the buildings appear to be displaced in the southwest direction. Moreover, based on the height of the tallest building in the image, the largest possible displacement is nearly 180 rows and 45 columns. Thus, the frame extent is 180 rows towards the south and 45 columns toward the west to the template’s location at the northeast corner. Visual inspection of Fig. 4(a) and (b) show that planar objects such as roads and other non-elevated structures are located at the same co-ordinates in both images. For example, the 94 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 1, FEBRUARY 2012 Fig. 5. Histogram of log product values of the red and yellow band intensities from the image scene 1 at view 3. Fig. 4. Pan-sharpened images using the Gram Schmidt method. (a) Pan-sharpened image from view angle 3 (scene 1). (b) Pan-sharpened image from view angle 2 (scene 1). red rectangular template shown in Fig. 4(a) and (b) is such a planar template. On the other hand, we can clearly see that the tall structures such as buildings roofs (yellow boxes in Fig. 4) are displaced considerably in the second image. Thus, this is about similarity comparison between two non-planar templates at point and . Moreover, the displacement of the non-planar templates depends on the building height. Since the two images have different view angles, the apparent areas of the regions representing shadows and vegetation vary between the images. By applying the template matching method as described in Section II.B, a relative height map was generated as in Fig. 7(a). In this process, was used for the template block size. Thus, the total number of templates in the first image was 625. The template size 21 21 was found to be optimal because a smaller template may have correlation with other regions in the search area that are not necessarily the best matches. In addition, if the template size is much larger, then it may have mixed categories of planar and non-planar regions. From Fig. 7(a), we can clearly see that the regions representing buildings showed high relative height values. However, a region corresponding to Fig. 6. Pan-sharpened image at scene 1 of view 3 with shadows and vegetation marked in red and yellow outlines, respectively. Only outlines of the respective masks are depicted here for visual appeal. Otherwise, detection is a binary segmentation and the masks obtained are binary intensity masks. one building had several different height values even though the actual height was constant for a given building. This discrepancy can be attributed to a high degree of similarity between different sections of buildings. Most of the templates belonging to non-buildings have a very low height output which is justified since they are planar objects. Based on a trial-and-error approach, the optimal threshold for a minimum correlation value for a match is found to be 0.6. In the histogram of the log-relative-height data in Fig. 7(b), the section shown in red represents the non-building regions in the image and the green section represents the regions that are mostly buildings. From this histogram, the minimum relative height value was 23 as discussed in Section II.B using the MBCV method. The regions TURLAPATY et al.: A HYBRID APPROACH FOR BUILDING EXTRACTION FROM SPACEBORNE MULTI-ANGULAR OPTICAL IMAGERY 95 Fig. 8. Output at the template matching stage. The image scene consisting of the remaining 42.7% of pixels is illustrated after extracting vegetation, shadows, and most of the planar templates. TABLE I PERCENTAGE OF IMAGES SCENE 1AND 2 OF VIEW ANGLE 3 SCREENED AT EACH PROCESSING STEP; STEP 1 CORRESPONDS TO VEGETATION AND SHADOW DETECTION AND STEP 2 CORRESPONDS TO SCREENING MOST OF NON-BUILDINGS BY THE TEMPLATE MATCHING Fig. 7. Template matching output: The template displacement calculated for each template in view 3 is assigned as the relative height for each 21 21 block. (a) Relative height of scene 1 from view angle 3. The color bar indicates the relative height value (template displacement). (b) Histogram of the relative height output. that are assigned to a non-building category account to 27.3% of the image. The percentage of image regions screened at different process steps is given in Table I. Hence, before the classification, a total of 57.3% of the image is pre-screened as non-buildings. A comparison of Fig. 8 with Fig. 4(a) shows that most of the building pixels are correctly detected through the height estimation method. However, a major issue with this method is false positives. One of the reasons for these false detections is the occlusion of some regions by large buildings in the image (Fig. 4(b)). For instance, the region shown in the red box in Fig. 8 can be clearly seen in Fig. 4(a) but it is nearly occluded by the large neighboring building in Fig. 4(b). However, there are also some regions, such as the yellow box in Fig. 8, that are not occluded by any large structures but still falsely represent a tall structure. This property can also be attributed to the correlation between different templates of a road texture. Moreover, the roads in the downtown of Rio De Janeiro consists of flyovers and the terrain itself has varied elevations in the study area. C. Building Extraction Result The image dataset was divided into 10 10 blocks. This size was determined based on the appropriate building size in the image; the smaller size would not be sufficient to catch enough variance, whereas the larger size would have mixed classes such as parking lots and roads. Hence, the total number of blocks obtained was . The labels for the building and non-buildings were obtained from the derived reference samples via a visual expert analysis. These reference sample labels were used in the training for each class of data. For a given block, the majority voting rule was adopted to assign a class label. Various statistical features were extracted for all the blocks. Now approximately 42% (1150/2704) of the feature vectors were available for further classification of buildings from non-building classes. In this experimental study, the training time was relatively low because of masking in the previous stage. The parameters used here in the SVM were: , , and , where is the complexity, , and the polynomial linear kernel of order ’ ’ [42]. A 3-fold cross validation strategy was used to determine the classification performance [42]. Table II lists the resulting completeness and correctness factors obtained from different view combinations. These metrics were computed from the confusion matrix using the equations given in [5], where completeness is basically the number of correct detections as a fraction 96 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 1, FEBRUARY 2012 CONFUSION MATRICES FROM FUSION EXPERIMENTS WITH CENTRAL VIEW 3, WHERE TABLE II AT SCENE 1 of actual building samples, while correctness is the number of correct detections as a fraction of total detections. The fusion of images from view 3 and view 2 gave significant improvement in both completeness and correctness. The combinations of view 1, 4, and 5 with view 3 could not result in statistically significant improvement in completeness (however, the correctness factors showed some improvement from these combinations). The major reason is that the image scene has very large buildings which results in both large shadows and occlusions. Because of the satellite position with respect to the Sun, the shadows and occlusions are on opposite sides of the buildings for the images in view 1, 4, and 5, losing some information related to planar regions (non-buildings). However, for view 2, there is lesser influence of occlusions on the template matching’s output as this view is on the same side as view 3 from nadir. Table III further gives the performance of fusing view 3 and view 2 in terms of the Kappa accuracy ( ), Overall Accuracy ( ), and Error Rate ( ). The is the ratio between the total correct detections and the total number of samples, and . The results reported confirms that fusing view 2 with view 3 yielded improvements in all these metrics. The McNemar’s test was adopted to verify the statistical significance of improvement. The terms in Table IV are defined as follows: is the number of samples for which both classifiers result in correct prediction, when classifier 1 is cor- AND 2 ON COMBINATIONS OF , AND DIFFERENT VIEWS TABLE III CLASSIFICATION PERFORMANCE METRICS WHEN FUSING VIEW 3 AND VIEW 2 TABLE IV STATISTICS FROM MCNEMAR’S TEST ON SCENE 1 (COMBINATION VIEW 3 AND VIEW 2 VERSUS ONLY VIEW 3) PERFORMANCE COMPARISON BASED ON CORRECT PROPORTIONS rect and classifier 2 is wrong, is the opposite case, and is the number of samples for which both classifiers fail. The chi-square statistic can be easily computed using these values [45]. Here, the chi-square statistic is (the chi-square distribution value for a one degree of freedom). Hence the improvement is significant at the 95% confidence level. TURLAPATY et al.: A HYBRID APPROACH FOR BUILDING EXTRACTION FROM SPACEBORNE MULTI-ANGULAR OPTICAL IMAGERY 97 Fig. 10. An approximate 3D model of buildings computed from the relative height data (displacement) for scene 1 of view 3. Fig. 9. Output of the hybrid building extraction algorithm: Classification maps are shown with building polygon outlines for better visual appeal. These polygon outlines indicate that all the pixels within the polygons are classified as buildings. (a) Original building classification map for scene 1 of view 3. (b) Refined building classification map for scene 1 of view 3. (c) Refined building classification map for scene 2 of view 3. Fig. 9 demonstrates the resulting building maps. The red polygons in Fig. 9(a) are the boundaries of the classification outputs of the buildings detected with each pixel inside the boundary detected true. This rough map included many false positives such as the small red polygons on roads and false negatives within the buildings. Most of these false alarms were reduced based on the minimum building area threshold criterion as discussed in Section II.C. The refined building maps for the two scenes are presented in Fig. 9(b) and (c), respectively. A preliminary analysis of the feature importance for the classification process reveals that the statistical features and the energy of the wavelet approximation sequence are the most significant features. Features, such as wavelet-based entropy, and maximum coefficient and statistics of the decomposition sequences, are less relevant. Especially, the combined usage of both statistical and all the four types of wavelet features reduces the classification performance. This behavior can be attributed to the similarity in the textural properties of the buildings and non-building areas. In a related study, the lower relevance of the texture properties for classification of similar optical imagery was determined when texture features were used [46]. A better visualization of the building classification output is also presented using the relative building height derived from the template matching as in Fig. 10. This approximate 3D building model is created using the shaded surface plots. It provides a more appealing visualization, compared to the previous height output in Fig. 7(a) where the height within a building was inconsistent and the shapes of the buildings were not well preserved. Averaging the building segments (shown in red in Fig. 9(b)) from the classification output resulted in smooth heights and shapes of complete buildings (Fig. 10). This could also be used to illustrate the relative height variations of tall buildings in the scene. The tallest building in the 3D building model (shown in a sky blue color) corresponds to the actual tallest building (based on visual interpretation) in the scene, followed by the second tallest building (in green), and small buildings shown in other colors (e.g., red, etc.). 98 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 1, FEBRUARY 2012 IV. SUMMARY AND FUTURE WORK In this study, a hybrid building extraction methodology is developed and applied to multi-angular imagery form the WorldView2 sensor. Initially, the Gram-Schmidt orthogonalization based pan-sharpening method is used to fuse the panchromatic and multispectral imagery. Then, vegetation and shadow regions are detected and masked with high accuracy. Later, a 2D Fourier transform-based template matching is used to estimate the relative height model of the image scene. In the final stage, features from the pan-sharpened data along with this height data are exploited for building extraction. A relative 3D building model is generated from the relative height data and building classification map. Both shape completeness and correctness factors for two subsets of imagery are above 88%. The performance improvement due to the incorporation of the height data is statistically significant ( ) over the baseline performance with a SVM-based binary classification of 83%. Some of the key parameters for the hybrid approachinclude: 1) the threshold for separating shadow pixels from non-shadow pixels, whose value may depend on the type of pan-sharpening method used; 2) the threshold for vegetation detection from NDVI data, which has to be decided from the histogram of the NDVI values; 3) the block size in the template matching, whose selection influences efficient omission of non-building blocks without losing any building blocks; and 4) the minimum height threshold, which separates buildings from non-buildings. We will study how to automatically optimize these parameters in future work. Other future work may concern the following issues. In this study, only three combinations of four multi-angular images were used. It might be possible to improve the performance of the height estimation method by fusing the results from other combinations. It is particularly possible to reduce false height estimations due to occlusions. It is also possible to use segmentation algorithms, such as watershed transformation, to reduce false positives by identifying the boundaries of the buildings. By using the metadata for the multi-angular imagery, such as cross track view angle, and off-nadir view angle [44], we will be able to estimate true building heights. Another issue with this method is that the edges of the detected building shapes are not as smooth as the original shapes. By reducing the pixel block size in the feature extraction process, the shape can be improved. However, this process may incur additional false alarms. Another possible direction for shape improvement is the application of ACM or curve evolution techniques. We will also explore the state of the art template matching techniques [47], [48] for reducing the false positives in the screened image. ACKNOWLEDGMENT The multi-angular images were provided by DigitalGlobe through the 2011 Data Fusion Contest, organized by the Data Fusion Technical Committee of the IEEE Geoscience and Remote Sensing Society. REFERENCES [1] D. Tuia and G. Camps-Valls, “Urban image classification with semi-supervised multi-scale cluster kernels,” IEEE J. Selected Topics in Applied Earth Observations and Remote Sensing, vol. 4, no. 1, pp. 65–74, Mar. 2011. [2] M. Izadi and P. Saeedi, “Automatic building detection in aerial images using a hierarchical feature based image segmentation,” in Proc. 20th Int. Conf. Pattern Recognition, 2010, pp. 472–475. [3] L. B. Theng, “Automatic building extraction from satellite imagery,” Eng. Lett., vol. 13, no. 5, Nov. 2006. [4] U. Weidner and W. Forstner, “Towards automatic building reconstruction from high resolution digital elevation models,” ISPRS J. Photogramm. Remote Sens., vol. 50, no. 4, pp. 38–49, 1995. [5] S. Ahmadi, M. J. V. Zoej, H. Ebabi, H. A. Moghaddam, and A. Mohammadzadeh, “Automatic urban building boundary extraction from high resolution aerial images using an innovative model of active contours,” Int. J. Appl. Earth Obs., vol. 12, pp. 150–157, Feb. 2010. [6] M. Kabolizade, H. Ebadi, and S. Ahmadi, “An improved snake model for automatic extraction of buildings from urban aerial images and LiDAR data,” Computers, Environment and Urban Systems, vol. 34, pp. 435–441, Apr. 2010. [7] M. Ortner, X. Descombes, and J. Zerubia, “Building outline extraction from Digital Elevation Models using marked point processes,” Int. J. Comput. Vision, vol. 72, no. 2, pp. 107–132, 2007. [8] O. Tournaire, M. Bredif, D. Boldo, and M. Durupt, “An efficient stochastic approach for building footprint extraction from digital elevation models,” ISPRS J. Photogramm., vol. 65, pp. 317–327, Mar. 2010. [9] M. Awrangjeb, M. Ravabaksh, and C. S. Fraser, “Automatic detection of residential buildings using LIDAR data and multispectral imagery,” ISPRS J. Photogramm., vol. 65, pp. 457–467, Jul. 2010. [10] U. Soergel, E. Michaelsen, A. Thiele, E. Cadario, and U. Thoennessen, “Stereo analysis of high-resolution SAR images for building height estimation in cases of orthogonal aspect directions,” ISPRS J. Photogramm., vol. 64, pp. 490–500, Feb. 2009. [11] E. Michaelsen, U. Stilla, U. Soergel, and L. Doktorski, “Extraction of building polygons from SAR images: Grouping and decision-level in the GESTALT system,” Pattern Recogn. Lett., vol. 31, pp. 1071–1076, Oct. 2009. [12] D. Brunner, G. Lemoine, L. Bruzzone, and H. Greidanus, “Building height retrieval from VHR SAR imagery based on an iterative simulation and matching technique,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 3, pp. 1487–1504, Mar. 2010. [13] R. Guida, A. Iodice, and D. Riccio, “Height retrieval of isolated buildings fromsingle high-resolution SAR images,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 7, pp. 2967–2979, Jul. 2010. [14] A. Alobeid, K. Jacobsen, and C. Heipke, “Building height estimation in urban areas from very high Resolution satellite stereo images,” Proc. ISPRS, vol. 38, 2009. [15] Z. Lari and H. Ebadi, “Automated Building Extraction from HighResolution Satellite Imagery using Spectral and Structural Information Based on Artificial Neural Networks,” University of Hannover, Germany, 2011 [Online]. Available: www.ipi.uni-hannover.de/fileadmin/ institut/pdf/Lari_Ebadi.pdf [16] H. T. Shandiz, S. M. Mirhassani, and B. Yousefi, “Heirarchical method for building extraction in urban areas images using unsharp masking [usm] and Bayesian classifier,” in Proc. 15th Int. Conf. Systems, Signals and Image Processing, 2008, pp. 193–196. [17] D. K. San and M. Turker, “Building extraction from high resolution satellite images using Hough transform,” in Int. Archives Photogrammetry, Remote Sensing and Spatial Information Science, Kyoto, Japan, 2010, vol. XXXVIII, Part 8. [18] P. M. Dare, “Shadow analysis in high-resolution satellite imagery of urban areas,” Photogramm. Eng. Remote Sens., vol. 71, no. 2, pp. 169–177, 2005. [19] J.-Y. Rau, N.-Y. Chen, and L.-C. Chen, “True orthophoto generation of built-up areas using multi-view images,” Photogramm. Eng. Remote Sens., vol. 68, no. 6, pp. 581–588, 2002. [20] T. F. Furtuna, “Dynamic programming algorithms in speech recognition,” Informatica Economica, vol. 2, no. 46, pp. 94–99, 2008. [21] T. Krauss, P. Reinartz, M. Lehner, M. Schroeder, and U. Stilla, “DEM generation from very high resolution stereo satellite data in urban areas using dynamic programming,” in Proc. ISPRS Hannover Workshop on High-Resolution Earth Imaging for Geospatial Information, Hannover, Germany, May 2005, pp. 17–20. [22] C. A. Laben, B. V. Brower, and , , Washington, DC, “Process for enhancing the spatial resolution of multispectral imagery using pansharpening,” U.S. patent 6,011,875, Jan. 4, 2000. [23] O. Demirkaya, M. H. Asyali, and P. K. Sahoo, Image Processing with MATLAB, Applications in Medicine and Biology. London, U.K.: CRC Press, 2009. TURLAPATY et al.: A HYBRID APPROACH FOR BUILDING EXTRACTION FROM SPACEBORNE MULTI-ANGULAR OPTICAL IMAGERY [24] T. Kurita, N. Otsu, and N. Abdelmaelk, “Maximum likelihood thresholding based on population mixtur models,” Pattern Recognit., vol. 25, pp. 1231–1240, 1992. [25] A. M. Polidorio, F. C. Flores, C. Franco, N. N. Imai, and A. M. G. Tommaselli, “Enhancement of terrestrial surface features on high spatial resolution multispectral aerial images,” in Proc. 23rd Conf. Graphics, Patterns and Images, Aug. 2010, pp. 295–300. [26] Z. Zhang and F. Chen, “A shadow processing method of high spatial resolution remote sensing image,” in Proc. 3rd Int. Congr. Image and Signal Processing, Oct. 2010, vol. 2, pp. 816–820. [27] D. Cai, M. Li, Z. Bao, Z. Chen, W. Wei, and H. Zhang, “Study on shadow detection method on high resolution remote sensing image based on HIS space transformation and NDVI index,” in Proc. 18th Int. Conf. Geoinformatics, Jun. 2010, pp. 1–4. [28] F. Lamberti, A. Sanna, and G. Paravati, “Improving robustness of infrared target tracking algorithms based on template matching,” IEEE Trans. Aerosp. Electron. Syst., vol. 47, no. 2, pp. 1467–1480, Apr. 2011. [29] G. Zhu, S. Zhang, X. Chen, and C. Wang, “Efficient illumination insensitive object tracking by normalized gradient matching,” IEEE Signal Process. Lett., vol. 14, no. 12, pp. 944–947, Dec. 2007. [30] Z. Lin and L. S. Davis, “Shape-based human detection and segmentation via hierarchical part-template matching,” IEEE Trans. Pattern Anal. Mach. Intel., vol. 32, no. 4, pp. 604–618, Apr. 2010. [31] H. Hashiba, K. Kameda, S. Tanaka, and T. Sugimura, “Digital roof model (DRM) using high resolution satellite image and its application for 3D mapping of city region,” in Proc. IEEE Int. Geoscience and Remote Sensing Symp., Jul. 2003, vol. 3, pp. 1990–1992. [32] H. V. Sorensen, C. S. Burrus, and M. T. Heideman, Fast Fourier Transform Database. Boston, MA: PWS Publishing, 1995. [33] K. Briechle and U. Hanebeck, “Template matching using fast normalized cross correlation,” in Proc. SPIE 4387, Orlando, FL, Apr. 2001, vol. 95. [34] D. Aboutajdine and F. Essannouni, “Fast block matching algorithms using frequency domain,” in Proc. Int. Conf. Multimedia Computing and Systems, Jul. 2011, pp. 1–6. [35] D. J. Kroon, “Fast/Robust Template Matching,” 2011 [Online]. Available: http://www.mathworks.com/matlabcentral/fileexchange/24925-fastrobust-template-matching [36] X. Meng, L. Wang, and N. Currit, “Morphology based building detection from airborne lidar data,” Photogramm. Eng. Remote Sens., vol. 75, no. 4, pp. 437–442, Apr. 2009. [37] X. Jin and C. H. Davis, “Automated building extraction from highresolution satellite imagery in urban areas using structural, contextual, and spectral information,” EURASIP J. Appl. Signal Process., vol. 14, pp. 2196–2206, 2005. [38] J. R. Jensen, “Thematic information extraction: Pattern recognition,” in Introductory Digital Image Processing: A Remote Sensing Perspective. New York: Prentice Hall, 2005. [39] C. W. Hsu and C. J. Lin, “A comparison of methods for multiclass support vector machines,” IEEE Trans. Neural Netw., vol. 13, no. 2, pp. 415–425, Mar. 2002. [40] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [41] C. Cortes and V. Vapnik, “Support-vector network,” Mach. Learn., vol. 20, pp. 273–297, 1995. [42] B. Gokaraju, S. S. Durbha, R. L. King, and N. H. Younan, “A machine learning based spatio-temporal data mining approach for detection of harmful algal blooms in the Gulf of Mexico,” IEEE J. Selected Topics in Applied Earth Observations and Remote Sensing, vol. 4, no. 3, pp. 710–720, 2011. [43] H. Ulrich and G. Kressel, “Pairwise classification and support vector machines,” in Advances in Kernel Methods—Support Vector Learning, B. Schölkopf, C. Burges, and A. J. Smola, Eds. Boston, MA: MIT Press, 1999, pp. 255–268. [44] J. K. N. Da Costa, “Geometric Quality Testing of the WorldView-2 Image Data Acquired Over the JRC Maussane Test Site using ERDAS LPS, PCI Geomatics and Keystone Digital Photogrammetry Software Packages – Initial Findings,” Publications Office of the European Union, 2010 [Online]. Available: http://publications.jrc.ec.europa.eu/repository/handle/111111111/15036 [45] G. M. Foody, “Thematic map comparison: Evaluating the statistical significance of differences in classification accuracy,” Photogramm. Eng. Remote Sens., vol. 70, no. 5, pp. 627–633, May 2004. [46] N. Longbotham, C. Bleiler, C. Chaapel, C. Padwick, W. Emery, and F. Pacifici, “Spatial classification of WorldView-2 multi-angle sequence,” in Proc. Joint Urban Remote Sensing Event, Munich, Germany, Apr. 2011, pp. 105–108. 99 [47] H. Schweitzer, R. Deng, and R. F. Anderson, “A dual-bound algorithm for very fast and exact template matching,” IEEE Trans. Pattern Anal. Mach. Intel., vol. 33, no. 3, pp. 459–470, Mar. 2011. [48] F. Essannouni and D. Aboutajdine, “Fast frequency template matching using higher order statistics,” IEEE Trans. Image Process., vol. 19, no. 3, pp. 826–830, Mar. 2010. Anish Turlapaty received the B.Tech. degree in electronics and communication engineering from Nagarjuna University, India, in 2002, the M.Sc. degree in engineering degree in radio astronomy & space sciences from Chalmers University, Sweden, in 2006, and the Ph.D degree in electrical engineering from Mississippi State University, Mississippi State, MS, in 2010. He is a postdoctoral associate in the Electrical and Computer Engineering Department of Mississippi State University. He is currently developing pattern recognition and digital signal processing based methods for detection of sub-surface radioactive waste using electromagnetic induction spectroscopy and gamma ray spectrometry. He has received graduate student of the year awards for outstanding research from both the Mississippi State University Centers and Institutes and the Geosystems Research Institute at Mississippi State University for the year 2010. He has published three peer-reviewed journal papers and presented eight papers and posters at international conferences and symposia. He is also an active reviewer for many international journals related to signal processing. Balakrishna Gokaraju received the B.Tech. degree in electronics and communication engineering from J.N.T.U. Hyderabad, India, in 2003, and the M.Tech. degree in remote sensing from Birla Institute of Technology, Ranchi, India, in 2006. He also received the Ph.D. degree in electrical and computer engineering at Mississippi State University, Mississippi State, MS, in 2010. He is currently employed as Post Doctoral Associate in the GeoSystems Research Institute at Mississippi State University’s High Performance Computation Collaboratory. His research interests include but not only limited to pattern recognition, machine learning, spatio-temporal data mining, adaptive signal processing. His research experience describes the development of various algorithms towards automatic target detection, classification, and prediction in the application domains of Multi-spectral, Hyper-spectral and Radar remote sensing. Various object oriented approaches were developed by exploiting the very high-resolution satellite sensor data for their potential in urban remote sensing studies. High performance computing using hybrid CPU-GPU architectures are also investigated for scientific applications. He also published in many peer-reviewed journals and international conferences. Dr. Gokaraju is an active reviewer for various journals and conferences, including Journal of Applied Geography (Elsevier), International Journal of Image and Data Fusion (Taylor and Francis), and IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS) (IEEE Geoscience and Remote Sensing Society). Qian Du (S’98–M’00–SM’05) received the Ph.D. degree in electrical engineering from the University of Maryland Baltimore County in 2000. She was with the Department of Electrical Engineering and Computer Science, Texas A&M University, Kingsville, from 2000 to 2004. She joined the Department of Electrical and Computer Engineering at Mississippi State University in Fall 2004, where she is currently an Associate Professor. Her research interests include hyperspectral remote sensing image analysis, pattern classification, data compression, and neural networks. Dr. Du currently serves as Co-Chair for the Data Fusion Technical Committee of IEEE Geoscience and Remote Sensing Society. She also serves as Associate Editor for IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS). She received the 2010 Best Reviewer Award from 100 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 1, FEBRUARY 2012 IEEE Geoscience and Remote Sensing Society. She is the General Chair for the 4th IEEE Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS). She is a member of SPIE, ASPRS, and ASEE. Nicolas H. Younan (SM’99) received the B.S. and M.S. degrees from Mississippi State University, in 1982 and 1984, respectively, and the Ph.D. degree from Ohio University in 1988. He is currently the Department Head and James Worth Bagley Chair of Electrical and Computer Engineering at Mississippi State University. His research interests include signal processing and pattern recognition. He has been involved in the development of advanced image processing and pattern recognition algorithms for remote sensing applications, image/data fusion, feature extraction and classification, automatic target recognition/identification, and image information mining. Dr. Younan has published over 150 papers in refereed journals and conference proceedings, and book chapters. He has served as the General Chair and Editor for the 4th IASTED International Conference on Signal and Image Processing (General Chair), Co-Editor for the 3rd International Workshop on the Analysis of Multi-Temporal Remote Sensing Images, Guest Editor, Pattern Recognition Letters, and Associate Editor, JSTARS. He is a senior member of IEEE and a member of the IEEE Geoscience and Remote Sensing society, serving on two technical committees: Data Fusion, and Data Archive and Distribution. James V. Aanstoos (SM’99) received the B.S. and M.E.E. degrees in electrical and computer engineering from Rice University, Houston, TX, in 1977 and 1979, respectively. He received the Ph.D. degree in atmospheric science from Purdue University, West Lafayette, IN, in 1996. He is currently an Associate Research Professor of the Geosystems Research Institute (GRI) at Mississippi State University and an adjunct in the Electrical and Computer Engineering Department. His current responsibilities include conducting and managing research on remotely sensed image and radar analysis, and teaching in the area of remote sensing. Current research activities include application of synthetic aperture radar and multispectral imagery to levee vulnerability assessment. Other recent research activities have focused on multispectral land cover classification and accuracy assessment, terrain database visualization, and development of a multispectral sensor simulator. He has published in several peer- reviewed journals and conference proceedings. Dr. Aanstoos has been an active member in professional societies such as IEEE, Sigma Xi, and the American Geophysical Union. He also rendered his professional services as an Executive Committee and Treasurer for Applied Imagery Pattern Recognition (AIPR) from 1995to 1999, General Chair AIPR 2001–2002, Program Chair AIPR 2000 and Secretary 2003–2011. He served in the technical advisory committee to North Carolina statewide land cover mapping project, 1995.