2. Joint HOG features

advertisement
1. Introduction
Object detection, which is searching for particular objects in an image, is one of
the biggest problems in the field of computer vision, and many methods for achieving it
have already been proposed. Most object detection methods of recent years have used
object distinguishing methods that apply statistical training methods for feature
weighting of local areas extracted from several thousand training samples. For the local
features used by such methods, low-level features such as Haar-like features
¥cite{Haar-like} and Edge Orientation Histograms (EOH) ¥cite{EOH} and Edgelets
¥cite{Egdelet}. Of those features, the Histogram of Oriented Gradients (HOG)
¥cite{HOG} has been demonstrated to be robust to changes in illumination and
geometrical transformations. With these low-level features only, however, the
recognition capability is limited.
Therefore, methods for generating mid-level features by using statistical learning
algorithms such as AdaBoost¥cite{AdaBoost} to combine low-level features based on
relations between feature weights (relatedness) to achieve highly accurate object
detection
have
been
proposed
in
recent
years.
Joint
Haar-like
features
¥cite{jointHaar-like} are represented as combinations of binary symbols using
Haar-like features, and detection is performed on the basis of the probabilities of their
co-occurrence. This method can be faster and more accurate than conventional face
detection, but the feature weights depend on the luminance value, so the method is not
suitable for humans and other objects that have diverse shapes and texture.
We therefore propose here an object detection method that uses joint HOG
features as mid-level features that can automatically capture object shape symmetry and
continuity. The joint HOG features are first obtained by combining the HOG features
for two different areas by means of the first-stage Real AdaBoost¥cite{Real AdaBoost}.
Doing so automatically generates a joint HOG feature that represents symmetry and
continuity, which are difficult to represent with a single feature. Then, the generated
joint HOG features are input to the second-stage Real AdaBoost to construct the final
classifier and detect the object. In this paper, we demonstrate the effectiveness of the
proposed method experimentally, taking as the objects of detection the non-rigid human
body and automobiles, which have greatly different appearances from different points of
view.
2. Joint HOG features
The flow of the proposed method is illustrated in Fig. ¥ref{fig1: flow}. This
method uses the second stage of Real AdaBoost to construct the final classifier. First, a
pool of joint HOG features that combine two low-level HOG features obtained for
different locations by the first-round Real AdaBoost. Doing so allows multiple HOG
features to be observed at the same time. That makes it possible to automatically
generate joint HOG features that can represent symmetry and edge continuity, qualities
which cannot be grasped by conventional single HOG features alone. Next, the
second-round Real AdaBoost selects from the joint HOG feature pool the features that
are best for automatic human detection, and then the final classifier is used to perform
the detection.
2.1 Low-level features: HOG features
In the work reported here, we use the Histogram of Oriented Gradients (HOG)
proposed by Dalal et alia as the low-level feature ¥cite{HOG}. HOG features are the
data for gradient orientations in local areas called cells (Fig. ¥ref{fig2: region}(b))
converted into histograms. They can capture the shape of an object, and are not affected
by changes in illumination. They are also robust to local changes in geometry. The
procedure for calculating the HOG features is described below.
From the luminance $L$ of each pixel, compute the gradient magnitude $m$ and
orientation $¥theta$ with the following formula.
The luminance gradient orientation histogram of each cell is generated from the
calculated gradient magnitude $m$ and orientation $¥theta$. The obtained gradient
orientations are divided into 20-degree groups to create the gradient orientation
histogram.
Finally, the feature weights are normalized to each block area (Fig. ¥ref{fig2:
region}(c)) with the following equation.
Here, $v$ is the HOG feature, $k$ is the number of HOG features in the block, and
$$ is a coefficient for preventing division by zero problems.
2.2 HOG feature co-occurrence
To generate the joint HOG features, we represent the co-occurrence of multiple
HOG features ¥cite{jointHaar-like}. Representing co-occurrence in this way makes it
possible to observe two features at the same time.
First, we calculate binary symbols $s$ that represent detection objects and
non-detection objects with the following equation.
Here, $¥theta$ is the threshold value, $p$ is a code that determines the direction
of the inequality sign, and takes the values $p¥in ¥{0, 1¥}$. ${¥bf V}=[v_{1}, v_{2},
¥cdots v_{o}]$ is the HOG feature weight calculated from one cell, and $o$ is the
orientation of the gradient. By combining the two binary symbols obtained in this way,
we get features $j$, which represent co-occurrence ¥cite{jointHaar-like}. For example,
when HOG feature binary symbols $s_{1}=1$ and $s_{2}=1$ are observed in an input
image such as shown in Fig. ¥ref{fig: Co}, the co-occurrence feature $j$, is $j=(11)
_{2}=3$. This $j$ is an index number for a binary representation of combined features.
In this case, there are four values because we are dealing with combinations of two
features.
2.3 Mid-level features: joint HOG features
The
HOG
feature
co-occurrence
values
calculated
in
section
¥ref{Co-occurrenceFeatures} are used to generate joint HOG features in the first-round
Real AdaBoost (Fig. ¥ref{fig: jointHOG}). This captures the relations of cells as well as
the symmetry of object shape and edge continuity.
First, from the features that represent co-occurrence for cells at two different
locations, $cm$ and $cn$, Real AdaBoost selects those that are effective in
discrimination. The function for observing HOG feature co-occurrence in input image
$x$ is expressed as $J_{t}(x)$. When feature $J_{t}(x)=j$ is observed in input image
$x$, the weak classifier $h_{t}(x)$ of the first-round Real AdaBoost is expressed as
follows.
Here, $t$ is the number of training rounds and $¥epsilon$ is a coefficient for
preventing division by zero problems. We determined by experiment that
¥epsilon=0.0000001$. The term $y$ is the correct label $y¥in ¥{+1,-1¥}$,
$P_{t}(y=+1¥mid j)$ and $P_{t}(y=-1¥mid j)$ are the respective conditional
probability distributions for when the features $j$ that represent HOG feature
co-occurrence are observed. The conditional probability distributions are calculated
with the following equation from the weights $D_{t}(i)$ of the training samples $i$.
The
conditional
probability
distributions
$P_{t}(y=+1¥mid
j)$
and
$P_{t}(y=-1¥mid j)$ are represented by one-dimensional histograms. The distributions
are created by calculating the features that represent co-occurrence from the training
samples $i$ and adding the training sample weights $D_{t}(i)$ to the corresponding
one-dimensional histogram BIN numbers $j$. Because the BIN numbers $j$ correspond
to index numbers for the features that represent co-occurrence, there are four BIN for
the first-round Real AdaBoost.
Next, we use the conditional probability distribution to obtain an evaluation value
$z$ that represents the separation of the distributions with the following equation.
Smaller values of $z$ indicate greater separation of the positive class and negative
class distributions. This $z$ is used in the selection of a weak classifier from among the
many co-occurrence features $j$ chosen according to the threshold value. The feature
with the smallest $z$ is selected as the weak classifier.
Finally, the joint HOG feature $H^{cm, cn}(x)$, which is the strong classifier of
the first-round Real AdaBoost, is constructed with the following equation.
The processing described above is applied to all combinations of cells to generate
as many joint HOG features as there are cell combinations. All of the generated joint
HOG features are put into a single pool for input to the second-round Real AdaBoost to
construct the final classifier as described below.
2.4 Constructing the final classifier from the second round Real AdaBoost
The second stage Real AdaBoost takes the joint HOG feature pool generated in
the first stage as input and constructs the final classifier. It is in this way possible to
automatically select joint HOG features that are effective in discrimination. The
following equation is used in the second-stage Real AdaBoost to obtain the final strong
classifier $G(c)$.
The $¥lambda$ is the detector threshold value and $c$ is a combination of cells.
The second Real AdaBoost round constructs the detector by selecting from the joint
HOG feature pool only the features that are effective in discrimination.
3. Discrimination experiments using joint HOG features
To demonstrate the effectiveness of this method, we conducted experiments to
evaluate the detection of humans and automobiles.
3.1 Databases
A part of the database is shown in Fig. ¥ref{fig: database}. The human image
database is the one used in ¥cite{AandS}. The images in the database are photographs
taken in multiple different locations. In the same way as in, ¥cite{AandS}, the database
used for training contains 2,054 positive samples and 6,258 negative samples, and we
used 1,000 positive samples and 1,235 negative samples for the evaluation database.
The positive samples in the automobile database are automobile areas cut out
from video images taken by a vehicle-mounted camera pointed rearward. The negative
samples were random areas from the background. The training database comprised
2,464 positive samples and 2,415 negative samples; the evaluation database comprised
1,899 positive samples and 2,413 negative samples.
3.2 Overview of the experiment
We used the evaluation database to conduct the human and automobile
discrimination experiments. The parameters for the experiments are listed in Table 1.
The gradient orientations shown in Table 1 are from 0 degrees to 360 degrees for
automobiles and from 0 degrees to 180 degrees for people. The reason for this
conversion is to obtain orientations for which a human’s clothing does not affect the
results, because there is sometimes an inverse relation of clothing and the background
color.
The relative illumination is taken to be the HOG feature + AdaBoost (HOG)
¥cite{HOG}. In the evaluation, we used the Detection Error Tradeoff (DET) graph,
which is a dual-log plot with false positives rate on the horizontal axis and the failure to
detect rate on the vertical axis. In a DET plot, values closer to the origin indicate better
performance.
3.3 Experimental results
The discrimination results are presented in Fig. ¥ref{fig: DET}(a) for humans and
in Fig. ¥ref{fig: DET}(b) for automobiles. We see that the proposed method
discriminates more accurately than the conventional method using HOG features. At a
false positives rate of 5.0$¥%$ for people, the improvement in discrimination
performance was about 24.6$¥%$. At a false positives rate of 2.0$¥%$ for automobiles,
the improvement in discrimination was about 7.0$¥%$. These improvements came with
the automatic selection of new effective features by combining the HOG features of two
different locations within a cell, thus allowing discrimination in patterns that are
difficult to discriminate with the HOG features alone.
3.4 Discussion
Here we discuss the joint HOG features selected by Real AdaBoost in the
experiments on discriminating humans and automobiles. The visualized results of the
selected HOG features are shown in Fig. ¥ref{fig: visual}(a) and Fig. ¥ref{fig:
visual}(d) for the first-round Real AdaBoost and in Fig. ¥ref{fig: visual}(b) and Fig.
¥ref{fig: visual}(e) for the second-round Real AdaBoost. In (c) and (f) of the same
figure, the two selected cells and the joint HOG features from the second-round Real
AdaBoost are shown for each round. The HOG feature gradient orientations are
represented by 9 directions for humans and 18 directions for automobiles. Higher
luminance corresponds to lower $z$ values of the weak classifier in Real AdaBoost,
indicating that the feature weights are effective in discrimination.
When people are the target for detection, HOG features are selected for all of the
cell areas in Fig. ¥ref{fig: visual}(a), but we can see that the edges that follow the shape
of a human have particularly high weights. Next, consider Fig. ¥ref{fig: visual}(b).
There is a tendency to not select features that are not part of the human outline, even
though they were selected HOG features selected in Fig. ¥ref{fig: visual}(a). This
probably results from the judgment that those features are ineffective for discrimination
in the feature selection of the second-round Real AdaBoost. Finally, consider Fig.
¥ref{fig: visual}(c). We can see that the selected joint HOG features of cells that follow
the outline of a human tend to be selected in the second-round Real AdaBoost. This
demonstrates that our method is effective in detecting the human form, which is a
non-rigid body.
When automobiles are the target for detection, we see in Fig. ¥ref{fig: visual}(d)
that many horizontal edges inside the automobile and edges that follow the outline of
the automobile are selected by the joint HOG features of the first-stage Real AdaBoost.
In Fig. ¥ref{fig: visual}(e), we see that from the HOG features selected in Fig. ¥ref{fig:
visual}(d), the joint HOG features that follow the outline of the automobile are selected
by the final classifier obtained from the second stage Real AdaBoost. We thus see that
the HOG features that follow the automobile outline are effective for distinguishing the
automobile from the background. Finally, consider Fig. ¥ref{fig: visual}(f). In the first
and second rounds of training, the positional relations of the vertical and horizontal
edges are selected. In the third round, cell that have left-right symmetry are selected. In
round 15, cells whose positional relations capture continuity are selected and
horizontally oriented features are selected. The proposed joint HOG features make it
possible to automatically select cells whose positional relationships represent symmetry
and continuity through training, without advance preparation of features that capture
automobile shape symmetry and continuity, and thus obtain an effective feature set for
object discrimination.
4. Experiments on human form detection with joint HOG and temporal features
It is also possible to use features other than the HOG features that we use here for
the low-level features. In this section, we describe a human form detection method that
combines HOG features and the pixel state analysis (PSA) based on temporal features
confirmed as effective by ¥cite{AandS}.
4.1 Use together with temporal features
We use pixel shape analysis (PSA) ¥cite{PSA} with temporal features that have
been shown effective by ¥cite{AandS}. Pixel state analysis involves distinguishing
pixels according to three states, background, stationary, and transient, by modeling the
temporal changes in pixel states %(Fig. ¥ref{fig: pixelstate}). To discriminate the three
states, a motion trigger (sharp changes in luminance in the frames before and after) and
a stability measure are used (Fig. ¥ref{fig: ex_pixelstate}).
In the same way as in ¥cite{AandS}, the results of the pixel state analysis are put
into histogram form according to cell areas, and feature weights are calculated to serve
as temporal features. Pixel state histograms are created using the same area structure as
used for the HOG features. Because pixels are discriminated into three states, three
feature weights can be calculated from a single histogram. % For an input image of
$30¥times 60$ pixels, % the PSA feature weights obtained from the temporal features
amount to $6 cells ¥times12 cells ¥times3 features =216$ features. By adding the HOG
features to the temporal feature, 12 features can be obtained from each cell. The training
process using joint features that combine HOG features and temporal features is shown
in Fig. ¥ref{fig: joint}.
4.2 Overview of the experiments
We performed the evaluation experiments using the human database described in
section ¥ref{DATA}. The results of the pixel state analysis were extracted as feature
weights. The pixel state analysis results for the positive samples, part of the database
used for training, are shown in Fig. ¥ref{fig: database2}. In the experiments, we
compared the proposed method (joint HOG-PSA + Real AdaBoost) with the
conventional methods of co-occurrence of HOG features and temporal features + Real
AdaBoost (HOG+PSA(co-occurrence)) ¥cite{AandS}, joint HOG features + Real
AdaBoost, and HOG features + AdaBoost.
4.3 Experimental results
The detection results are shown by DET in Fig. ¥ref{fig: det2}. The proposed
method exhibits higher detection performance than the conventional method of using
HOG feature and temporal feature co-occurrence. For a false positives rate of 5.0¥%,
the improvement in detection rate was 2.2¥%, achieving a detection rate of
approximately 99¥%.
4.4 Discussion
Here we discuss the proposed method in terms of the features selected in training.
The average gradient image for when the positive samples used for training are used is
shown in Fig. ¥ref{fig: AverageImage}(a). In (b), (c) and (d) of the same figure, the
visualized images of the frequency of occurrence of the three states are shown using the
images of the results of the pixel state analysis of the positive samples used for training.
We see from Fig. ¥ref{fig: AverageImage} that higher luminance indicates a stronger
gradient or higher frequency. In the average image for the background state in Fig.
¥ref{fig: AverageImage}(b), we can see that a human silhouette is represented. This
indicates that, even though pixel state analysis cannot explicitly distinguish background
and foreground, background and foreground can be distinguished by treating the
background and everything that is not background as a two class problem. Furthermore,
from Fig. ¥ref{fig: AverageImage}(c) and (d) we can see that stationary state pixels
have high frequencies in the upper half of the human shape, while transient state pixels
have high frequencies at the feet.
A visualization of the features actually selected by Real AdaBoost is shown in Fig.
¥ref{fig: FeatureSelect2}. Features obtained from the background by pixel state analysis
are often used at the beginning of the training feature. This indicates that humans can be
distinguished according to whether there are many or few background state pixels.
We can also see that there is a tendency for temporal features to be selected at the
beginning of training and for HOG features to be selected in the last half of training.
With Real AdaBoost, features selected in the initial stage of the training round tend to
be the more effective features. To examine that effect in more detail, we show the
proportion of selected HOG features and temporal features for each training round in
the first stage of Real AdaBoost {Fig. ¥ref{fig: ratio}). At the beginning of training,
temporal features are selected in very high proportion, and from round 5 the HOG
features are selected more often. That is probably because, at the time of discrimination,
temporal features that can represent object movement first roughly distinguish between
human and non-human, and then HOG features, which capture appearance information,
are selected to form detailed classification boundaries.
5. Object detection experiments
We conducted object detection experiments using the classifiers that we have
constructed so far. To detect the target objects, the detection window was raster scanned
over the image from the upper left multiple times at different scales. That allows
detection of targets even it they appear at different scales in the image. Windows in
which the target was detected are finally integrated by Mean Shift clustering
¥cite{MeanShift}. The detection results are presented in Fig. ¥ref{fig: detector}(a) for
humans and in Fig. ¥ref{fig: detector}(b) for automobiles. The detection of automobiles
was highly accurate, with no failures to detect and no false positives. In the detection of
humans, temporal features of movement are used, so highly accurate detection is
possible even for when there are objects similar in shape to humans or when the
background is complex.
5.1 Overview of the counting experiments
We used the results of the human detection experiments to count the number of
people in a scene. We used a sequence of 500 video frames taken in three locations. The
actual numbers of people in the images were counted by eye. People who were partially
out of the image were excluded from the detection task. Also excluded were the cases
when human forms overlap in the image and more than half of a body is hidden. The
evaluation metric is detection performance computed from the following equation.
5.2 Experimental results
Next, we counted the number of persons from the human detection results. The
counting results are presented in Table ¥ref{tbl: count}.
The counting of people by the proposed method produced a detection
performance of 93.7¥%. The usual accuracy of counting people by humans for
marketing purposes is said to be 90¥%. Because the proposed method can automatically
count the number of humans with high accuracy, we expect it can find applications in
marketing and other fields.
6. Conclusion
In this paper, we propose an object detection method that uses joint HOG features,
which combine multiple HOG features, and two-stage Real AdaBoost training. The joint
HOG features are effective for discrimination because they can capture relations among
cells as features as well as object symmetry in shape and edge continuity. In future work,
we plan to accomplish human detection in images from vehicle-mounted cameras using
scene context information.
Download