1. Introduction Object detection, which is searching for particular objects in an image, is one of the biggest problems in the field of computer vision, and many methods for achieving it have already been proposed. Most object detection methods of recent years have used object distinguishing methods that apply statistical training methods for feature weighting of local areas extracted from several thousand training samples. For the local features used by such methods, low-level features such as Haar-like features ¥cite{Haar-like} and Edge Orientation Histograms (EOH) ¥cite{EOH} and Edgelets ¥cite{Egdelet}. Of those features, the Histogram of Oriented Gradients (HOG) ¥cite{HOG} has been demonstrated to be robust to changes in illumination and geometrical transformations. With these low-level features only, however, the recognition capability is limited. Therefore, methods for generating mid-level features by using statistical learning algorithms such as AdaBoost¥cite{AdaBoost} to combine low-level features based on relations between feature weights (relatedness) to achieve highly accurate object detection have been proposed in recent years. Joint Haar-like features ¥cite{jointHaar-like} are represented as combinations of binary symbols using Haar-like features, and detection is performed on the basis of the probabilities of their co-occurrence. This method can be faster and more accurate than conventional face detection, but the feature weights depend on the luminance value, so the method is not suitable for humans and other objects that have diverse shapes and texture. We therefore propose here an object detection method that uses joint HOG features as mid-level features that can automatically capture object shape symmetry and continuity. The joint HOG features are first obtained by combining the HOG features for two different areas by means of the first-stage Real AdaBoost¥cite{Real AdaBoost}. Doing so automatically generates a joint HOG feature that represents symmetry and continuity, which are difficult to represent with a single feature. Then, the generated joint HOG features are input to the second-stage Real AdaBoost to construct the final classifier and detect the object. In this paper, we demonstrate the effectiveness of the proposed method experimentally, taking as the objects of detection the non-rigid human body and automobiles, which have greatly different appearances from different points of view. 2. Joint HOG features The flow of the proposed method is illustrated in Fig. ¥ref{fig1: flow}. This method uses the second stage of Real AdaBoost to construct the final classifier. First, a pool of joint HOG features that combine two low-level HOG features obtained for different locations by the first-round Real AdaBoost. Doing so allows multiple HOG features to be observed at the same time. That makes it possible to automatically generate joint HOG features that can represent symmetry and edge continuity, qualities which cannot be grasped by conventional single HOG features alone. Next, the second-round Real AdaBoost selects from the joint HOG feature pool the features that are best for automatic human detection, and then the final classifier is used to perform the detection. 2.1 Low-level features: HOG features In the work reported here, we use the Histogram of Oriented Gradients (HOG) proposed by Dalal et alia as the low-level feature ¥cite{HOG}. HOG features are the data for gradient orientations in local areas called cells (Fig. ¥ref{fig2: region}(b)) converted into histograms. They can capture the shape of an object, and are not affected by changes in illumination. They are also robust to local changes in geometry. The procedure for calculating the HOG features is described below. From the luminance $L$ of each pixel, compute the gradient magnitude $m$ and orientation $¥theta$ with the following formula. The luminance gradient orientation histogram of each cell is generated from the calculated gradient magnitude $m$ and orientation $¥theta$. The obtained gradient orientations are divided into 20-degree groups to create the gradient orientation histogram. Finally, the feature weights are normalized to each block area (Fig. ¥ref{fig2: region}(c)) with the following equation. Here, $v$ is the HOG feature, $k$ is the number of HOG features in the block, and $$ is a coefficient for preventing division by zero problems. 2.2 HOG feature co-occurrence To generate the joint HOG features, we represent the co-occurrence of multiple HOG features ¥cite{jointHaar-like}. Representing co-occurrence in this way makes it possible to observe two features at the same time. First, we calculate binary symbols $s$ that represent detection objects and non-detection objects with the following equation. Here, $¥theta$ is the threshold value, $p$ is a code that determines the direction of the inequality sign, and takes the values $p¥in ¥{0, 1¥}$. ${¥bf V}=[v_{1}, v_{2}, ¥cdots v_{o}]$ is the HOG feature weight calculated from one cell, and $o$ is the orientation of the gradient. By combining the two binary symbols obtained in this way, we get features $j$, which represent co-occurrence ¥cite{jointHaar-like}. For example, when HOG feature binary symbols $s_{1}=1$ and $s_{2}=1$ are observed in an input image such as shown in Fig. ¥ref{fig: Co}, the co-occurrence feature $j$, is $j=(11) _{2}=3$. This $j$ is an index number for a binary representation of combined features. In this case, there are four values because we are dealing with combinations of two features. 2.3 Mid-level features: joint HOG features The HOG feature co-occurrence values calculated in section ¥ref{Co-occurrenceFeatures} are used to generate joint HOG features in the first-round Real AdaBoost (Fig. ¥ref{fig: jointHOG}). This captures the relations of cells as well as the symmetry of object shape and edge continuity. First, from the features that represent co-occurrence for cells at two different locations, $cm$ and $cn$, Real AdaBoost selects those that are effective in discrimination. The function for observing HOG feature co-occurrence in input image $x$ is expressed as $J_{t}(x)$. When feature $J_{t}(x)=j$ is observed in input image $x$, the weak classifier $h_{t}(x)$ of the first-round Real AdaBoost is expressed as follows. Here, $t$ is the number of training rounds and $¥epsilon$ is a coefficient for preventing division by zero problems. We determined by experiment that ¥epsilon=0.0000001$. The term $y$ is the correct label $y¥in ¥{+1,-1¥}$, $P_{t}(y=+1¥mid j)$ and $P_{t}(y=-1¥mid j)$ are the respective conditional probability distributions for when the features $j$ that represent HOG feature co-occurrence are observed. The conditional probability distributions are calculated with the following equation from the weights $D_{t}(i)$ of the training samples $i$. The conditional probability distributions $P_{t}(y=+1¥mid j)$ and $P_{t}(y=-1¥mid j)$ are represented by one-dimensional histograms. The distributions are created by calculating the features that represent co-occurrence from the training samples $i$ and adding the training sample weights $D_{t}(i)$ to the corresponding one-dimensional histogram BIN numbers $j$. Because the BIN numbers $j$ correspond to index numbers for the features that represent co-occurrence, there are four BIN for the first-round Real AdaBoost. Next, we use the conditional probability distribution to obtain an evaluation value $z$ that represents the separation of the distributions with the following equation. Smaller values of $z$ indicate greater separation of the positive class and negative class distributions. This $z$ is used in the selection of a weak classifier from among the many co-occurrence features $j$ chosen according to the threshold value. The feature with the smallest $z$ is selected as the weak classifier. Finally, the joint HOG feature $H^{cm, cn}(x)$, which is the strong classifier of the first-round Real AdaBoost, is constructed with the following equation. The processing described above is applied to all combinations of cells to generate as many joint HOG features as there are cell combinations. All of the generated joint HOG features are put into a single pool for input to the second-round Real AdaBoost to construct the final classifier as described below. 2.4 Constructing the final classifier from the second round Real AdaBoost The second stage Real AdaBoost takes the joint HOG feature pool generated in the first stage as input and constructs the final classifier. It is in this way possible to automatically select joint HOG features that are effective in discrimination. The following equation is used in the second-stage Real AdaBoost to obtain the final strong classifier $G(c)$. The $¥lambda$ is the detector threshold value and $c$ is a combination of cells. The second Real AdaBoost round constructs the detector by selecting from the joint HOG feature pool only the features that are effective in discrimination. 3. Discrimination experiments using joint HOG features To demonstrate the effectiveness of this method, we conducted experiments to evaluate the detection of humans and automobiles. 3.1 Databases A part of the database is shown in Fig. ¥ref{fig: database}. The human image database is the one used in ¥cite{AandS}. The images in the database are photographs taken in multiple different locations. In the same way as in, ¥cite{AandS}, the database used for training contains 2,054 positive samples and 6,258 negative samples, and we used 1,000 positive samples and 1,235 negative samples for the evaluation database. The positive samples in the automobile database are automobile areas cut out from video images taken by a vehicle-mounted camera pointed rearward. The negative samples were random areas from the background. The training database comprised 2,464 positive samples and 2,415 negative samples; the evaluation database comprised 1,899 positive samples and 2,413 negative samples. 3.2 Overview of the experiment We used the evaluation database to conduct the human and automobile discrimination experiments. The parameters for the experiments are listed in Table 1. The gradient orientations shown in Table 1 are from 0 degrees to 360 degrees for automobiles and from 0 degrees to 180 degrees for people. The reason for this conversion is to obtain orientations for which a human’s clothing does not affect the results, because there is sometimes an inverse relation of clothing and the background color. The relative illumination is taken to be the HOG feature + AdaBoost (HOG) ¥cite{HOG}. In the evaluation, we used the Detection Error Tradeoff (DET) graph, which is a dual-log plot with false positives rate on the horizontal axis and the failure to detect rate on the vertical axis. In a DET plot, values closer to the origin indicate better performance. 3.3 Experimental results The discrimination results are presented in Fig. ¥ref{fig: DET}(a) for humans and in Fig. ¥ref{fig: DET}(b) for automobiles. We see that the proposed method discriminates more accurately than the conventional method using HOG features. At a false positives rate of 5.0$¥%$ for people, the improvement in discrimination performance was about 24.6$¥%$. At a false positives rate of 2.0$¥%$ for automobiles, the improvement in discrimination was about 7.0$¥%$. These improvements came with the automatic selection of new effective features by combining the HOG features of two different locations within a cell, thus allowing discrimination in patterns that are difficult to discriminate with the HOG features alone. 3.4 Discussion Here we discuss the joint HOG features selected by Real AdaBoost in the experiments on discriminating humans and automobiles. The visualized results of the selected HOG features are shown in Fig. ¥ref{fig: visual}(a) and Fig. ¥ref{fig: visual}(d) for the first-round Real AdaBoost and in Fig. ¥ref{fig: visual}(b) and Fig. ¥ref{fig: visual}(e) for the second-round Real AdaBoost. In (c) and (f) of the same figure, the two selected cells and the joint HOG features from the second-round Real AdaBoost are shown for each round. The HOG feature gradient orientations are represented by 9 directions for humans and 18 directions for automobiles. Higher luminance corresponds to lower $z$ values of the weak classifier in Real AdaBoost, indicating that the feature weights are effective in discrimination. When people are the target for detection, HOG features are selected for all of the cell areas in Fig. ¥ref{fig: visual}(a), but we can see that the edges that follow the shape of a human have particularly high weights. Next, consider Fig. ¥ref{fig: visual}(b). There is a tendency to not select features that are not part of the human outline, even though they were selected HOG features selected in Fig. ¥ref{fig: visual}(a). This probably results from the judgment that those features are ineffective for discrimination in the feature selection of the second-round Real AdaBoost. Finally, consider Fig. ¥ref{fig: visual}(c). We can see that the selected joint HOG features of cells that follow the outline of a human tend to be selected in the second-round Real AdaBoost. This demonstrates that our method is effective in detecting the human form, which is a non-rigid body. When automobiles are the target for detection, we see in Fig. ¥ref{fig: visual}(d) that many horizontal edges inside the automobile and edges that follow the outline of the automobile are selected by the joint HOG features of the first-stage Real AdaBoost. In Fig. ¥ref{fig: visual}(e), we see that from the HOG features selected in Fig. ¥ref{fig: visual}(d), the joint HOG features that follow the outline of the automobile are selected by the final classifier obtained from the second stage Real AdaBoost. We thus see that the HOG features that follow the automobile outline are effective for distinguishing the automobile from the background. Finally, consider Fig. ¥ref{fig: visual}(f). In the first and second rounds of training, the positional relations of the vertical and horizontal edges are selected. In the third round, cell that have left-right symmetry are selected. In round 15, cells whose positional relations capture continuity are selected and horizontally oriented features are selected. The proposed joint HOG features make it possible to automatically select cells whose positional relationships represent symmetry and continuity through training, without advance preparation of features that capture automobile shape symmetry and continuity, and thus obtain an effective feature set for object discrimination. 4. Experiments on human form detection with joint HOG and temporal features It is also possible to use features other than the HOG features that we use here for the low-level features. In this section, we describe a human form detection method that combines HOG features and the pixel state analysis (PSA) based on temporal features confirmed as effective by ¥cite{AandS}. 4.1 Use together with temporal features We use pixel shape analysis (PSA) ¥cite{PSA} with temporal features that have been shown effective by ¥cite{AandS}. Pixel state analysis involves distinguishing pixels according to three states, background, stationary, and transient, by modeling the temporal changes in pixel states %(Fig. ¥ref{fig: pixelstate}). To discriminate the three states, a motion trigger (sharp changes in luminance in the frames before and after) and a stability measure are used (Fig. ¥ref{fig: ex_pixelstate}). In the same way as in ¥cite{AandS}, the results of the pixel state analysis are put into histogram form according to cell areas, and feature weights are calculated to serve as temporal features. Pixel state histograms are created using the same area structure as used for the HOG features. Because pixels are discriminated into three states, three feature weights can be calculated from a single histogram. % For an input image of $30¥times 60$ pixels, % the PSA feature weights obtained from the temporal features amount to $6 cells ¥times12 cells ¥times3 features =216$ features. By adding the HOG features to the temporal feature, 12 features can be obtained from each cell. The training process using joint features that combine HOG features and temporal features is shown in Fig. ¥ref{fig: joint}. 4.2 Overview of the experiments We performed the evaluation experiments using the human database described in section ¥ref{DATA}. The results of the pixel state analysis were extracted as feature weights. The pixel state analysis results for the positive samples, part of the database used for training, are shown in Fig. ¥ref{fig: database2}. In the experiments, we compared the proposed method (joint HOG-PSA + Real AdaBoost) with the conventional methods of co-occurrence of HOG features and temporal features + Real AdaBoost (HOG+PSA(co-occurrence)) ¥cite{AandS}, joint HOG features + Real AdaBoost, and HOG features + AdaBoost. 4.3 Experimental results The detection results are shown by DET in Fig. ¥ref{fig: det2}. The proposed method exhibits higher detection performance than the conventional method of using HOG feature and temporal feature co-occurrence. For a false positives rate of 5.0¥%, the improvement in detection rate was 2.2¥%, achieving a detection rate of approximately 99¥%. 4.4 Discussion Here we discuss the proposed method in terms of the features selected in training. The average gradient image for when the positive samples used for training are used is shown in Fig. ¥ref{fig: AverageImage}(a). In (b), (c) and (d) of the same figure, the visualized images of the frequency of occurrence of the three states are shown using the images of the results of the pixel state analysis of the positive samples used for training. We see from Fig. ¥ref{fig: AverageImage} that higher luminance indicates a stronger gradient or higher frequency. In the average image for the background state in Fig. ¥ref{fig: AverageImage}(b), we can see that a human silhouette is represented. This indicates that, even though pixel state analysis cannot explicitly distinguish background and foreground, background and foreground can be distinguished by treating the background and everything that is not background as a two class problem. Furthermore, from Fig. ¥ref{fig: AverageImage}(c) and (d) we can see that stationary state pixels have high frequencies in the upper half of the human shape, while transient state pixels have high frequencies at the feet. A visualization of the features actually selected by Real AdaBoost is shown in Fig. ¥ref{fig: FeatureSelect2}. Features obtained from the background by pixel state analysis are often used at the beginning of the training feature. This indicates that humans can be distinguished according to whether there are many or few background state pixels. We can also see that there is a tendency for temporal features to be selected at the beginning of training and for HOG features to be selected in the last half of training. With Real AdaBoost, features selected in the initial stage of the training round tend to be the more effective features. To examine that effect in more detail, we show the proportion of selected HOG features and temporal features for each training round in the first stage of Real AdaBoost {Fig. ¥ref{fig: ratio}). At the beginning of training, temporal features are selected in very high proportion, and from round 5 the HOG features are selected more often. That is probably because, at the time of discrimination, temporal features that can represent object movement first roughly distinguish between human and non-human, and then HOG features, which capture appearance information, are selected to form detailed classification boundaries. 5. Object detection experiments We conducted object detection experiments using the classifiers that we have constructed so far. To detect the target objects, the detection window was raster scanned over the image from the upper left multiple times at different scales. That allows detection of targets even it they appear at different scales in the image. Windows in which the target was detected are finally integrated by Mean Shift clustering ¥cite{MeanShift}. The detection results are presented in Fig. ¥ref{fig: detector}(a) for humans and in Fig. ¥ref{fig: detector}(b) for automobiles. The detection of automobiles was highly accurate, with no failures to detect and no false positives. In the detection of humans, temporal features of movement are used, so highly accurate detection is possible even for when there are objects similar in shape to humans or when the background is complex. 5.1 Overview of the counting experiments We used the results of the human detection experiments to count the number of people in a scene. We used a sequence of 500 video frames taken in three locations. The actual numbers of people in the images were counted by eye. People who were partially out of the image were excluded from the detection task. Also excluded were the cases when human forms overlap in the image and more than half of a body is hidden. The evaluation metric is detection performance computed from the following equation. 5.2 Experimental results Next, we counted the number of persons from the human detection results. The counting results are presented in Table ¥ref{tbl: count}. The counting of people by the proposed method produced a detection performance of 93.7¥%. The usual accuracy of counting people by humans for marketing purposes is said to be 90¥%. Because the proposed method can automatically count the number of humans with high accuracy, we expect it can find applications in marketing and other fields. 6. Conclusion In this paper, we propose an object detection method that uses joint HOG features, which combine multiple HOG features, and two-stage Real AdaBoost training. The joint HOG features are effective for discrimination because they can capture relations among cells as features as well as object symmetry in shape and edge continuity. In future work, we plan to accomplish human detection in images from vehicle-mounted cameras using scene context information.