Uploaded by Денис Горелов

Соколова диссертация

advertisement
National Research University Higher School of Economics
as a manuscript
Sokolova Anna
Neural network model for human recognition by gait in
different types of video
PhD Dissertation
for the purpose of obtaining academic degree
Doctor of Philosophy in Computer Science
Academic Supervisor:
PhD, Associate Professor
Anton S. Konushin
Moscow — 2020
Федеральное государственное автономное образовательное учреждение
высшего образования
«Национальный исследовательский университет
«Высшая школа экономики»
На правах рукописи
Соколова Анна Ильинична
Нейросетевая модель распознавания человека по походке в
видеоданных различной природы
ДИССЕРТАЦИЯ
на соискание ученой степени
кандидата компьютерных наук
Научный руководитель:
кандидат физико-математических наук,
доцент Конушин Антон Сергеевич
Москва –– 2020
3
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1. Background . . . . . . . . . . . . . . .
1.1 Influential factors . . . . . . . . . .
1.2 Gait recognition datasets . . . . . .
1.3 Literature review . . . . . . . . . .
1.3.1 Manual feature construction
1.3.2 Automatic feature training .
1.3.3 Event-based methods . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
12
14
17
17
22
24
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
classification
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
26
28
29
31
37
39
40
40
44
3. Pose-based gait recognition . . . . . . . . . . . . .
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . .
3.2 Proposed algorithm . . . . . . . . . . . . . . . . .
3.2.1 Body part selection and data preprocessing
3.2.2 Neural network pipeline . . . . . . . . . .
3.2.3 Feature aggregation and classification . . .
3.3 Experimental evaluation . . . . . . . . . . . . . .
3.3.1 Performance evaluation . . . . . . . . . . .
3.3.2 Experiments and results . . . . . . . . . .
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
46
47
48
50
53
55
55
55
64
4. View resistant gait recognition . . . . . . . . . . . . . . . . . . . . .
66
2. Baseline model . . . . . . . . . . . .
2.1 Motivation . . . . . . . . . . . . .
2.2 Proposed algorithm . . . . . . . .
2.2.1 Data preprocessing . . . .
2.2.2 Neural network backbone .
2.2.3 Feature postprocessing and
2.3 Experimental evaluation . . . . .
2.3.1 Evaluation metrics . . . .
2.3.2 Experiments and results .
2.4 Conclusions . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
4
4.1
4.2
4.3
4.4
4.5
Motivation . . . . . . . . . .
View Loss . . . . . . . . . .
Cross-view triplet probability
Experimental evaluation . .
Conclusions . . . . . . . . .
. . . . . . .
. . . . . . .
embedding
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
66
68
71
73
74
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
76
76
77
79
80
81
83
84
86
89
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
5. Event-based gait recognition .
5.1 Dynamic Vision Sensors . . .
5.2 Data visualization . . . . . . .
5.3 Recognition algorithm . . . . .
5.3.1 Human figure detection
5.3.2 Pose estimation . . . .
5.3.3 Optical flow estimation
5.4 Event-based data . . . . . . .
5.5 Experimental evaluation . . .
5.6 Conclusions . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5
Introduction
According to Maslow studies [1], safety need is one of the basic and
fundamental human needs. People tend to protect themselves, preserve their housing
from illegal invasion and save property from stealing.
With the development of modern video surveillance systems, it becomes
possible to capture everything that happens in a certain area and then analyze
the obtained data. Using video recordings, one can track the movements of people,
determine illegal entry into private territory, identify criminals who get captured
by the cameras, control access to restricted objects. For example, video surveillance
systems help to catch burglars, robbers or arsonists, automatically count the number
of people in a line or in crowd, and analyze the character of their movements reducing
the amount of subjective human intervention and decreasing the time required for
data processing. Besides this, being embedded in the currently widely used home
assistance systems (“smart home”), they can distinguish family members and change
behavior depending on the personality. For example, it can be configured to conduct
different actions if it captures a child or elderly person.
Dissertation topic and its relevance
Recently, the problem of recognizing a person in a video (see Fig. 0.1) has
become particularly urgent. A human’s personality is identifiable in a video based
on several criteria, and the most accurate one is facial features. However, the
current recognition quality allows to entrust decision-making to a machine only in a
cooperative mode, when person’s face is compared with a high quality photograph
in a passport. In real life (especially when committing crimes), a person’s face may
be hidden or poorly visible due to bad view, insufficient lighting, or the presence
of a headgear, mask, makeup, etc. In this case, another characteristic is required
to make the recognition, and gait is the possible one. According to biometric and
physiological studies [2; 3], the manner the person walks is individual and can not be
falsified, which makes gait a unique identifier comparable to fingerprints or the iris
of the eyes. In addition, unlike these “classic” methods of identification, gait can be
observed at a great distance, it does not require perfect resolution of the video and,
most importantly, no direct cooperation with a human is needed, thus, a human
6
may not know that he is being captured and analyzed. So, in some cases gait serves
as the only possible sign for determining a person in the video surveillance data.
Figure 0.1 — Informal problem statement: having a video with a walking person
one needs to determine this person’s identity from the database
The gait recognition problem is very specific due to the presence of many
factors that change the gait visually (different shoes; carried objects; clothes that
hide parts of the body or constrain movements) or affect the internal representation
of the gait model (angle, various camera settings). In this regard, the quality and
reliability of identification by gait is much lower than by face, and, despite the
success of modern methods of computer vision, this problem has not been solved
yet. Many methods are applicable solely to the conditions present in the databases
on which they are trained, which limits their usability in real life.
In addition to the classic surveillance cameras that store everything that
happens in the frame 25 − 30 times per second, other types of sensors, in particular,
dynamic vision sensors (Dynamic Vision Sensors, DVS [4—6]), are gaining popularity
in recent years. Unlike conventional video cameras, the sensor, like the retina,
captures changes in intensity in each pixel, ignoring points with constant brightness.
Under the conditions of a static sensor, events at background points are very rarely
generated, preventing the storage of redundant information. At the same time,
the intensity at each point is measured several thousand times per second, which
leads to the asynchronous capture of all important changes. As a result, such a
stream of events turns out to be very informative and suitable for obtaining the
data necessary for solving many video analysis tasks that require the extraction of
dynamic characteristics, including gait recognition.
Dynamic vision sensors are now a promising rapidly developing technology,
which leads to the need to solve video analysis tasks for the received data. Despite
the constant development of computer vision methods, no approaches to solving the
7
gait recognition problem according to the data of dynamic vision sensors have yet
been proposed, and it represents a vast field for research.
The methods of deep learning based on the training of neural networks have
become the most successful in solving computer vision problems in recent years.
Attributes taught by neural networks often have a higher level of abstraction,
which is necessary for high-quality recognition. This allows to achieve outstanding
results in solving such problems as the classification of video and images, image
segmentation, object detection, visual tracking, etc. However, despite the success
of deep learning methods, classical computer vision methods are still ahead of
neural networks in some gait recognition benchmarks and both approaches have
not achieved acceptable accuracy for full integration yet.
Goals and objectives of the research
This research aims to develop and implement the neural network algorithm
for human identification in video based on the motion of the points of human figure
that is stable to viewing angle changes, different clothing and carrying conditions.
To achieve this goal the following tasks are set:
1. Develop and implement the algorithm for human identification in video
analysing the optical flow.
2. Develop an implement the multiview algorithm for gait recogntion based
on the analysis of the motion of different human body parts.
3. Develop the algorithm for human recognition in the event-based data from
Dynamic Vission Sensors.
Formal problem statement
The formal research objects are video surveillance frame sequences 𝑣 and event
streams 𝑒 from the dynamic vision sensors where the moving person is captured.
Having the labelled gallery 𝐷 given one needs to deternime, which person from the
gallery appears in video, i.e. the identity of the person in video is to be defined.
Let the gallery be defined as
𝐷 = {(𝑣𝑖 , 𝑥𝑖 )}𝑁
𝑖=1 , 𝑥𝑖 ∈ 𝑃,
where 𝑁 is the number of sequences and 𝑃 is the set of subjects. The label of the
subject 𝑥 ∈ 𝑃 is to be found for the video 𝑣 under investigation. The goal is to
construct some similarity measure 𝑆 according to which the closest object will be
8
searched for in the gallery.
𝑆(𝑣,𝑣𝑖 ) →
min
𝑣𝑖 : ∃𝑥𝑖 : (𝑣𝑖 ,𝑥𝑖 )∈𝐷
The problem for the event streams is stated similarly.
A set of restrictions is imposed on all the sequences:
– each video in the gallery contains exactly one person;
– each person is captured full length;
– no occlusions;
– static camera;
– the set of posible camera settings (its height and tilt) is limited.
The described conditions are introduced due to the limitations of the existing
datasets and benchmarks.
Novelty and summary of the authors main results
In this thesis, the author introduces the original method for human recognition
by gait stable to view changes, reducing the length of the video sequences and
dataset transfer. The following is the list of the main research results. The list of the
corresponding publications can be found in section Publications at the page 9.
1. Side-view gait recognition method which analyses the points translations
between consecutive video frames is proposed and implemented. The
method shows state-of-the-art quality on the side-view gait recognition
benchmark.
2. Multi-view gait recognition method based on the consideration of
movements of the points in different areas of human body is proposed
and implemented. The state-of-the-art recognition accuracy is achieved for
cerrain viewing angles and the best at the investigation time approaches
are outperformed in verification mode.
3. The influence of the point movements in different body parts on the
recognition quality is revealed.
4. Gait recongition method stable to dataset transfer is proposed and
implemented.
5. Two approaches for view resistance improvement are proposed and
implemented. Both methods increase the cross-view recognition quality
and complement each other being applied simultaneously. The model
9
obtained using these approaches outperforms the state-of-the-art methods
on multi-view gait recogntion benchmark.
6. The method for human recognition by motion in the event-based data from
dynamic vision sensor is proposed and implemented. The quality close to
conventional video recognition is obtained.
The described results are original and obtained for the first time. Below, the
author’s contributions are summarized in four main points.
1. The first gait recognition method based on the investigation of the point
movements in different parts of the body is proposed and implemented.
2. Two original methods of view resistance improvements are proposed. In
these approaches the auxiliary model regularization is made and the
descriptors are projected into the special feature space decreasing the view
dependency.
3. The original research of the gait recognition algorithm transfer between
different data collections is made.
4. The first method for human recognition by motion in the event-based data
from dynamic vision sensor is propose and implemented.
Practical significance
Gait recognition is an applied problem of computer vision. Being proposed
according to natural and mathemathical reasons, all the suggested methods and
approaches aim to be applicable. Thus, being implemented, the proposed human
identification methods can be intergrated to different automation systems. For
example, the developed approach can be used in the home assistance systems (“smart
home”) which recognize the family members and change the behaviour depending
on the captured person. Being united with the alarm, the system can respond to the
appearance of the people not included to the family, and track the illegal entrance
into the private houses.
Besides this, the gait identification algorithm can be used in crowded places,
such as train stations and airports, where it is not possible to take close-up shots,
but there is an obvious need to track and control access.
Publications and approbation of the research
10
Main results of this thesis are published in the following papers. The PhD
candidate is the main author in all of these articles.
First-tier publications:
– A. Sokolova, A. Konushin, Pose-based deep gait recognition // IET
Biometrics, 2019, (Scopus, Q2).
Second-tier publications:
– A. Sokolova, A. Konushin, Gait recognition based on convolutional neural
networks // International Archives of the Photogrammetry, Remote Sensing
& Spatial Information Sciences, 2017 (Scopus).
– A. Sokolova, A. Konushin, Methods of gait recognition in video //
Programming and Computer Software, 2019 (WoS, Scopus, Q3).
– A. Sokolova, A. Konushin, View Resistant Gait Recognition // ACM
International Conference Proceeding Series, 2019 (Scopus).
The results of this thesis have been reported at the following conferences
and workshops:
– ISPRS International workshop “Photogrammetric and computer vision
techniques for video surveillance, biometrics and biomedicine” – PSBB17,
Moscow, Russia, May 15 – 17, 2017. Talk: Gait recognition based on
convolutional neural networks.
– Computer Vision and Deep Learning summit “Machines Can See”, Moscow,
Russia, June 9, 2017. Poster: Gait recognition based on convolutional neural
networks.
– 28th International Conference on Computer Graphics and Vision
“GraphiCon 2018”, Tomsk, Russia, September 24 – 27, 2018. Talk: Review
of video gait recognition methods.
– Samsung Biometric Workshop, Moscow, Russia, April 11, 2019. Talk:
Human identification by gait in RGB and event-based data.
– 16th International Conference on Machine Vision Applications (MVA),
Tokyo, Japan, May 27 – 31, 2019. Poster: Human identification by gait
from event-based camera.
– 3rd International Conference on Video and Image Processing (ICVIP),
Shanghai, China, December 20 – 23, 2019. Talk: View Resistant Gait
Recognition (best presentation award).
11
Thesis outline. The thesis consists of the introduction, background, four
chapters corresponding to developed approaches and the conclusion.
12
1. Background
In this chapter, the current state of affairs in gait recognition field is described.
Prior to description of existing methods, the datasets available for gait recognition
research are listed. There exists a large number of different gait collections, however
they hardly cover the conditions that can influence the gait representation. The large
variability of affecting factors is the main problem of gait recognition challenge, thus,
we firstly describe these factors, then list the databases including the variability of
some of these factors and finish the chapter with the overview of existing approaches.
1.1
Influential factors
Similarly to any computer vision problem, in gait recognition there are a lot
of factors that do not change the content of the video but change the appearance
and, therefore, influence the inner representation of the gait. All these factors can be
divided into two groups: affecting the gait itself and affecting only the representation.
The factors from the first group really change the manner of walk making the
recognition more complex for both human and computer. The second set contains
the conditions that change only the inner representation while the gait remains
the same. It can be obvious for the human that there is the same person in two
videos, but for the computer there are two absolutely different video sequences
with different features.
Let us start with the first group. These factors include the external conditions,
such as different footwear (that can be tight or uncomfortable, or have high
heels) changing the gait or carrying conditions (bags, backpacks or any heavy
load) that occupy hands and make the walk heavier. Besides this, there are inner
conditions changing the gait as, for example, illness (injury, lameness, fatigue, etc.)
or drunkeness. One more inner condition is long time interval between recordings.
Even if nothing has happend since the first record, if a long time has passed before
the next recording the gait can change a bit. Figure 1.1
The second set of the influential factorsillustrated in Fig. 1.2 do not change the
walking manner, but change its representation. These factors are more cunning, since
they can be imperceptible for human eye, but powerfull for machine. They include
view variation which is the most challenging problem for gait recognition, different
clothing that can hide some parts of the body and change the shape of human figure
13
(a)
(b)
(c)
(d)
(e)
Figure 1.1 — Examples of factors affecting the gait itself: using a supporting cane
(a), wearing high heels (b) and carrying conditions: bag (c), backpack (d) or other
loads (e)
(for example, long dresses, coats or puffer jackets), various occlusions, and change of
lightening and camera settings. A human can even not notice the difference, but it is
a big challenge to train a model stable to all these variations. All the decribed factors
can be combined making the recognition problem even more complex (see Fig. 1.1
(b) with long dress and heels, Fig. 1.2 (d) with dresses and occlusions or (e) with a
coat and carrying conditions, and all the images 1.1, 1.2 having different views).
(a)
(b)
(c)
(d)
(e)
Figure 1.2 — Examples of factors affecting the gait representation: wearing an
overall (a), long dress (b) or coat (c), occlusions (d), and bad lightening and low
quality of the frame (e)
Due to such great variability of influential conditions, large general datasets
are hardly collectable. Different researchers pay their attention on different factors
ignoring the others, which leads to overfitting of the models to the exact conditions
presented in the database. The next section describes the datasets available to the
time of the study and the conditions presented in them. One can see, that even the
14
largest databases do not cover all the conditions leaving a large field for further
research.
1.2
Gait recognition datasets
At the time of the study, the most widely used complex datasets for gait
recognition are “TUM Gait from Audio, Image and Depth” (TUM-GAID) [7], OU­
ISIR Large Population Dataset (OULP) [8] and CASIA Gait Dataset B [9]. These
databases are not the only existing collections for gait recognition, but the most
popular due to their size and condition variability, thus they are the datasets used
in this study. In this section, these collections are described as well as other gait
recognition databases.
The first and the most simple database, TUM-GAID, is used for side-view
recognition (see Fig. 1.3). It contains videos for 305 subjects going from left to
right and from right to left captured from the lateral view. The frame rate of
videos is about 30 fps and the resolution of each frame is 320 × 240. TUM-GAID
contains videos with different walking conditions: 6 normal walks, 2 walks carrying
a backpack, and 2 walks wearing coating shoes for each person. Although there are
recordings only from one viewpoint, the number of subjects allows to take around a
half of the people as training set and another half for evaluation. The authors of the
database have provided the split into training, validation, and testing sets, so that
the models can be trained using the data for 150 subjects and tested with the other
155. Hence, no randomness influenced the result. Being full-color, this collection it
applicable in a large number of experiments with different approaches. In addition,
in this database there are videos taken with six month interval, which makes it
possible to check the stability of the algorithms to temporal gait changes.
normal walk
normal walk
coating shoes
backpack
Figure 1.3 — Examples of frames from TUM-GAID database
The second dataset, CASIA B, is relatively small in terms of subjects number,
but it has a large variation of views: 124 subjects are captured from 11 different
15
viewing angles (from 0° to 180° with the step 18°). The videos last 3 to 5 seconds,
the frame rate is about 25 fps, and the resolution equals 320 × 240 pixels. All the
trajectories are straight and recorded indoor in the same room for all people. Besides
the variety of the viewpoints, similarly to the first collection, different clothing and
carrying conditions are presented: wearing a coat and carrying a bag. In total, 10
video sequences are available for each person captured from each view: 6 normal
walks without extra conditions, two walks in an outdoor clothing and two walks
with a bag. CASIA dataset is very popular due to view variability. Nevertheless,
there are too few subjects to train a deep neural network to recognize gait from all
views without overfitting. The examples of video frames from CASIA B collection
are shown in Fig. 1.4.
normal walk
normal walk
outerwear
bag
Figure 1.4 — Examples of frames from CASIA Gait Dataset B
The third one, OULP set, consists of video sequences for more than 4,000
people taken with two cameras, and the shooting angle varies smoothly from
55° to 85°. The data from this collection is distributed in the form of silhouette
masks, therefore not all the described methods are applicable to this database. The
silhouettes are binarized and normalized to size 88 × 128, as shown in Fig. 1.5.
Nevertheless, since many of the developed gait recognition methods are estimated
on this database, it is used for comparison of the approaches, as well.
Figure 1.5 — Examples of frames from OU-ISIR Large Population Dataset
16
However, there exist a lot of other gait databases collected for experiments in
different conditions, that have to be mentioned for completeness, as well. In Technical
University of Munich not only TUM-GAID dataset was collected, but another one,
TUM IIT-KGP [10] in collaboration with Indian Institute of Technology Kharagpur.
This database is rather small containing 35 subjects, but the walks are captured in
6 different conditions including dynamic and static occlusions (with the objects that
move or do not move, respectively). Thus, TUM IIT-KGP addresses the occlusion
challenge which is complicating but frequently occurs in real life.
One more factor varying in some of the databases is walking speed. For
example, CASIA Gait database C [11] and CMU Motion of body dataset (CMU
MoBo [12]) contain normal, slow and fast walks, and OU-ISIR Gait Speed Transition
Dataset (OUST) [13] consists of the videos of walks where the speed gradually
changes: accelarates from 1km/h to 5km/h and decelerates from 5km/h to 1
km/h. Another database collected in Osaka University, OU-ISIR Treadmill (OUTM)
dataset A [14] contains walks made on treadmill with the speed variation from 1km/h
to 10km/h. Similarly to OULP [8], all the OU-ISIR collections are dictributed in a
form of binary masks which restricts their usage for experiments.
Osaka University researchers have also collected the age varying dataset, OU­
ISIR Gait Database, Large Population Dataset with Age (OULP + Age) [15] where
the age of the subjects is distributed from 2 to 90 years. The investigations show
that the walking manner can change depending on the human’s age, thus, it is worth
verifying if the correlation really exists. To examine, if the gait can change during the
long period of time, the time varying datasets are collected. Similary to TUM-GAID
where 32 subjects are captured with 6 month time interval, MIT AI collection [16]
and USF HumanID dataset [17] contain different types of variations including time
interval between recordings. The latter one contains 1870 sequences obtained from
122 subjects with 32 condition variotaion, thus, the size of the database is too large
and it is distibuted only on physical USB drive, which makes its distribution very
long and complicated process.
It is worth noting one more variation included in some of the datasets: terrain
variation. Although the background is usually static in gait recognition collections,
the environment can influence the representation, thus, different terrains can prevent
the overfitting on the dataset. The datasets containing such variations are USF
HumanID dataset and SOTON Large database [18].
17
The collections described above are not the only available datasets, many other
databases [19—23] address the problem of multi-view recognition, clothing variation
and carrying conditions. Most of them are quite small which prevents their usage
for neural network training, others (such as OU-ISIR Multi-View Large Population
Dataset (OU MVLP) [24]) have been collected recently, after the proposed approach
was developed, thus they also have not been used in the experiments. Nevertheless,
these databases influence the developement of gait recognition methods, so, they
are worth mentioning. Table 1.1 shows some details of the video datasets available
for gait recognition.
1.3
Literature review
Similarly to most of computer vision problems, there are currently two
approaches to obtaining gait features from the source data: the hand-crafted
construction of the descriptors and their automatic training.
The first method is more traditional and is usually based on the calculation
of various properties of silhouette binary masks or on the study of joints positions,
their relative distances and speeds and other kinetic parameters. Feature training is
usually made by artificial neural networks that have become very popular in recent
years due to outstanding results in many computer vision problems, such as video
and image classification, image segmentation, object detection, visual tracking, and
others. Features that are trained using neural networks often have a higher level
of abstraction which is necessary for high-quality recognition. The best quality of
identification is achieved by methods combining these two approaches. Initially, the
basic characteristics of the gait are computed manually, and then they are fed to a
neural network to extract more abstract features. However, despite the success of
deep methods, there are non-deep methods outperformung the neural networks on
some of the benchmarks, thus both global approaches are worthy of attention.
1.3.1
Manual feature construction
Let us firstly discuss some classical basic approaches in which the gait
descriptors are extracted manually for natural reasons.
18
Table 1.1 — Gait recognition datasets
Database
Num. of
subjects
Num. of Environmental
sequences settings
Variations
TUM-GAID [7]
305
3370
Indoor
Directions(2), load(2), time(2),
footwear(2), clothing(2)
TUM-IITKGP [10]
35
840
Indoor
Directions(2), load(2), occlusion
CASIA A [19]
20
240
Outdoor
Views(3)
CASIA B [9]
124
13640
Indoor
Views(11), clothing(2), load(2)
CASIA C [11]
153
1530
Outdoor,
thermal camera
Speed(3), load(2)
OUTM A [14]
34
612
TM, Indoor
Speed(9)
OUTM B [14]
68
2176
TM, Indoor
Clothing(32)
OUTM C [14]
200
5000
TM, Indoor
Views(25)
OUTM D [14]
185
370
TM, Indoor
Gait fluctuations
OULP [8]
4007
7842
Indoor
Views(4)
OUST [13]
179
n/a
Indoor,
TM/static floor
Speed shift
OULP + Age [15]
63,846
63,846
Indoor
Age (from 2 to 90)
OU MVLP [24]
10307
n/a
Indoor
Views(14)
AVAMVG [23]
20
1200
Indoor
Directions(2), views(2)
CMU MoBo [12]
25
600
TM, Indoor
View(6), speed(2), inclination(2),
load(2)
USF [17]
122
1870
Outdoor
View(2), terrain(2), footwear(2),
load(2), time(2)
MIT AI [16]
24
194
Indoor
Views(4), time
SOTON Large [18]
114
2128
Indoor/outdoor,
TM/static floor
View(2), terrain(3)
SOTON
Temporal [22]
25
26160
Indoor
View(12), clothing(2), time(5),
footwear(2)
19
Binary silhouettes
The most common gait characteristic is the gait energy image (GEI) [25], the
average over one gait cycle of binary silhouette masks of moving person. Such spatial­
temporal description of human gait is computed under the assumption that the
movements of the body are periodically repeated during the walking. The resulting
images characterize the frequency of a moving person being in a particular position.
This approach is widely used and many other gait recognition methods are based
on it. Besides this, many approaches not using gait energy images directly, suggest
similar aggregation of other basic features. For example, the recognition can be
made by the gait entropy images (GEnI) [26], where the entropy of each pixel is
calculated instead of silhouette averaging, or by the frame difference energy image
(FDEI) [27] reflecting the differences between silhouettes in consecutive frames of
video. The visualizations of binary silhouette mask and the images of gait energy
and entropy are shown in Fig. 1.6.
The problem of all these methods averaging frame features is the loss of
temporal information, since the order of the frames is not taken into account.
Several approaches [28—30] aim at preventing this loss. Thus, Chrono-Gait Images
(CGI) [28] encode the temporal information by extracting the contours from the
silhouettes and weighting them according to the position of the corresponding frame
in a gait cycle. Another gait descriptor, Gait Flow Image (GFI) [29], is based on the
binarized optical flow between consecutive silhouette images and reflects the step-by­
step changes of silhouettes. One more silhoutee-based gait representation technique
preserving temporal information is frequency-domain gait entropy (EnDFT) [30].
This approach combines GEnI with discrete Fourier transform (DFT) [31] where the
mean of binary silhouette masks with exponentially decreasing weights is calculated.
EnDFT is the mixture of these approaches, the entropy is computed from DFT
instead of GEI.
Despite the equal simplicity and naturalness of these methods, it is GEI that
are used and developed so far. Similarly to ordinary black-and-white images, GEI
can be used for further features calculation, such as histograms of oriented gradients
(HOG-descriptors) [32; 33] or histograms of optical flow (HOF-descriptors) [34; 35],
or for constructing more complex classification algorithms that use the specifics of
the gait recognition problem.
20
Binary silhouette
GEI
Figure 1.6 — Examples of basic gait descriptors
GEnI
One of the most successful multi-view approaches [36] also uses GEI as
basic features. It is a Bayesian approach considering the gait energy images as
random matrices being the sum of the gait identity and some independent noise
corresponding to different gait variations (such as viewpoint, different clothes or
the presence of carried things), and both terms are supposed to be normal random
variables. Considering the joint distribution of GEI pairs under the assumption that
the corresponding classes coincide or differ reduces the problem to the optimization
problem for the covariance matrices that can be solved using the EM algorithm.
Another approach considering correlation [37] is an adaptation of the canonical
correlation analysis for gait recognition problem.
Another successful classical approach that uses GEI as gait features is based
on linear discriminant analysis (LDA) [38]. The idea of making an embedding such
that the features get more clusterized is proposed in several works [39—42]. In
MVDA approach [40] a multi-view discriminant analysis is carried: for gait features
computed for each viewing angle, a separate embedding is learned to minimize the
intraclass variation and maximize the interclass one. Similar approach is proposed
in ViDP (view-invariant discriminative projection) [42] aiming to decrease the
difference between the features of the same identity regardless the view and increase
the gaps between different identities, and in [41], where a unified framework is
proposed for learning the metrics of joint intensity of a pair of images and the
spatial metric. The sequential optimization of both metrics leads to a model that
surpasses the basic discriminant analysis models in recognition quality.
Let us finish the descpition of silhouette based approaches by mentioning
two more methods [43; 44]. In these approaches the natural assumaption is made
21
that the same-view recognition is easier than the cross-view one, thus, the view
transformation models are proposed to embed the descriptors corresponding to
different views in a common subspace.
The described approaches are intuitively clear and mathematically simple,
which, additionally to high recognition results, gives them an advantage over many
other more complex methods. A common drawback of most of these multi-view
recognition methods is the need to calculate the gait descriptor for each viewing
angle present in the database. Therefore, for each frame of the video one needs to
know the shooting angle, which is not always possible in real data.
Human pose
Another important source of information, complementing silhouettes, is the
human skeleton. Many recognition methods are based on human pose investigation:
the position of the joints and the main parts of the body and their motion in the video
while walking. Approaches based on posture range from fully structural (considering
the kinematic characteristics of the pose) to more complex, combining the kinematic
and spatial-temporal characteristics. The work [45], in which the recognition takes
place both by gait and by appearance, can be included in the first group. The main
characteristics of the gait used in this approach are absolute and relative distances
between the joints, as well as features based on the displacement of key points
of the figure between frames. Various kinematic parameters, such as trajectories
of the hip, ankles and knees, and relative angles between different joints are also
proposed [46] for gait recognition, additionally to gait period and walking speed.
The authors of one more approach [47] introduce a more complex mathematical
model, considering a family of smooth Poisson distance functions for constructing
a skeleton variance image (SVI).
Several other works also propose the models based on the position of human
body parts, but use them in conjunction with features of a different nature. For
example, the method [48] combines a kinematic approach with a spatial one,
considering both the trajectories of the joints movement and the change in silhouette
shape over time as dynamic features of gait.
22
Body points trajectories
Another source of information that can be used for gait recognition is the set of
point trajectories. The Fisher motion descriptors are proposed [49] to be constructed
on the trajectories and further classified by the support vector machine.
1.3.2
Automatic feature training
Despite the abundance of structural non-deep approaches, convolutional neural
networks (CNN) have a strong position in all tasks of computer vision, including
gait recognition. Over the past few years, many neural network identification
methods for gait have been proposed, differing both technically (by choosing
network architectures, loss functions, training methods) and ideologically by the
method of data processing and extracting the primary features supplied to the
network input. The similarity of gait and action recognition problems means that
many approaches applied to the latter can be used for the former. The first and
most pushing deep method was proposed by Simonyan and Zisserman [50]. To
recognize human actions, they trained a convolutional network with an architecture
comprising two similar types of branches: image and flow streams. The former
processes raw frames of video to get the spatial presentation, while the latter
gets the maps of optical flow (OF) computed from pairs of consecutive frames
as input to extract the temporal information. To consider long-duration actions,
several consecutive flow maps are stacked into blocks that are used as network
inputs. Many action recognition algorithms based on this framework apply variant
top architectures (including the recurrent architectures [51] and methods for fusing
several streams [52]). Another method based on optical flow considers temporal
information using three-dimensional convolutions [53]. Actually, the optical flow is
quite similar to dense trajectories [49], the difference is that considering trajectories
we fix the initial “starting” point while the optical flow is each time considered in a
new location. Thus, one more approach [54] proposes to get temporal information not
from the optical flow, but from the trajectories computed from the video. Combining
these trajectories with spatial features extracted from neural network gives enough
information for gait recognition.
The second and most popular source of information, used for neural network
training, is the binary masks of silhouettes, which have already been discussed
when considering non-deep methods. In the simplest DeepGait method [55], the
23
convolutional architecture is trained to predict the individual by separate silhouettes.
The network is further used to extract features, and the aggregation of individual
frame descriptors over the entire video occurs by selecting the maximum response
over the gait cycle. This method is the simplest of all deep approaches, since when the
silhouette masks of people are available no additional preprocessing is required. The
length of gait cycle is determined by considering the autocorrelation of a sequence
of binary images. DeepGait is developed by [56] approach where the Joint Bayesian
method [36] is applied to the neural features obtained from the silhouettes.
Another method using the silhouettes themselves is proposed in view-invarian
gait recognition network (VGR-Net) [57]. The two-step algorithm firstly determines
the shooting angle of the video, and then it predicts the person basing on the initial
data and model for the angle found (its own for each one). To consider not only
the spatial, but also the temporal characteristics of the motion, not separate masks,
but blocks of several consecutive silhouettes are fed to the input of the network. In
addition, the variability between frames is taken into account due to the network
architectures in both subtasks: the authors use three-dimensional networks in which
convolutions are performed not only in spatial domain, but also in temporal one.
In other work [58] where 3d convolutions are proposed to capture spatio-temporal
features, one network used for all views shows almost perfect quality of same-view
recognition.
It is also worth to highlight a variety of methods utilizing the popular GEI
mentioned above. Models based on them range from the simplest GEINet [59],
where a small network predicts a person from the gait energy images calculated for
different angles, to more complex ones [60], where the similarity of a GEI images is
determined and various methods of comparing neural network gait features obtained
from energy images are being investigated. In other works [61; 62], also using
Siamese architectures with two or three streams trained by contrastive or triplet
loss optimization, it is determined which gait images are close and belong to the
same person, and which images belong to different people. The first approach able
to outperform [60] being state-of-the-art for several years is GaitSet [63] considering
the sets of silhouettes that are invariant to permutation and allow to mix the frames
obtained under different views. The authors consider various aggregation functions
and their combination (Set Pooling) to get set-level features from frame-level ones
and also present Horizontal Pyramid Mapping which splits the neural feature maps
into the strips of different scales in order to aggregate multi-scale information.
24
Besides, most of the architectures and approaches popular in recent years are
successfully used for gait recognition. For example, autoencoders used for image
generation and representation learning are also applicable for identification [64]. It
is proposed to solve the problems of variability of views and carried items using
a variety of autoencoders, each performing its own transformation similar to view
transformation models [43]. Thus, passing through a sequence of “coding” layers,
the image is transformed to a side view, which is easier for recognition. Similar
approaches are proposed using generative adversarial networks (GAN) [65; 66] to
convert the GEI to the “basic” view (lateral viewing angle, no bag and coat).
Additionally to the classical forward propagation convolutional networks,
recurrent neural networks can be used for gait recognition as well as for other video
analysis tasks. Recurrent architectures allow to calculate informative dynamic gait
features directly from the source data. For example, LSTM [67] layers are used to
aggregate features obtained from raw frames and optical flow maps [51] or silhouette
masks [68]. However, there are many approaches that apply recurrent layers not just
to hidden neural representations, but to preliminary trained exact interpretable
features. Thus, the authors of several methods [69—71] construct a pose estimation
network whose outputs are then fed to LSTM layer. Such pose-based approaches
are more general since they do not take the shape of human figure into account.
Battistone and Petrosino [72] go further and unite the recurrent neural network
with a graph approach exploiting both structured data and temporal information.
Most of appearing neural network approaches are being applied to gait
recognition which leads to active developement of this area.
1.3.3
Event-based methods
Since the research of this thesis is evolving towards dynamic vision, the review
of event-based methods has to be provided. Dynamic vision sensors [4—6] have
become very popular recently. Due to their memory and temporal effectiveness,
they are integrated in video surveillance systems, thus, most of computer vision
problems have to be solved for the data obtained by these sensors. Recently, many
solutions have been proposed for different related problems, however, gait recognition
challenge had not been considered in termes of dynamic vision untill the study
presented in this work. Nevertheless, the approaches for some computer vision
problems are worth mentioning.
25
For example, many different models are proposed for object detection and
tracking problems [73—75] that usually play an auxiliarry role for gait recognition.
Another close problem is action recognition which is also considered in terms
of DVS data [76; 77]. In the latter one [77] three two-dimensional projections
of the event stream are considered: the natural x-y projection consisting of the
events aggregated over the time interval together with x-t and y-t projection where
events are accumulated over horizontal and vertical intervals. The features are
extracted from these maps using Speeded Up Robust Features (SURF) [78] and
than clusterized and classified. One more related problem is gesture recognition [79;
80], which can not be solved only by appearance but requires the motion as well as
gait recogntion. The first approach [79] is based on Hidden Markov Models, while
the second one [80] uses the methods of deep learning.
The last problem that is worth mentioning while discussing the event-based
vision is optical flow estimation. Since both OF and event stream reflect the motion,
the methods of OF computing from the DVS data are developed especially actively.
The approaches vary from absolutely precise mathematical [81] methods where the
differential equations relating OF and event stream are constructed and solved, to
more “trainable”, such as deep EV-FlowNet [82] approach based on encoder-decoder
neural architecture. Various methods for event data are appearing constantly leading
to simplification the processing of such data and spreading dynamic vision sensors
over the video surveillance field.
26
2. Baseline model
2.1
Motivation
Being a problem of biometrical identification, gait recognition has a lot in
common with other classical problems such as face, iris, or fingerprint recognition.
In terms of machine learning, all of these problems can be stated in a similar way:
having a sequence of frames with a walking person (or an image of face, fingerprint,
etc.) one needs to define an Id of presented person from the labelled database, i.e.
classify the object. Mathematically, all the difference is the dimensionality of input
tensor due to the additional temporal dimension of the video.
One more similar computer vision problem is re-identification (re-id) [83]. This
problem also generally concerns videos, but is usually based on the comparison of
separate frames and defining if there is the same person in different frames according
to the appearance rather than motion.
However, the presense of continious frame stream changes the nature of the
problem. Considering gait recognition problem, one can see, that the temporal
component contains much information which can be used for better identification.
This makes the investigated problem close to action recognition and other video
classification tasks, and allows to apply the methods developed for these problems
(e.g. [84]) for gait recognition. Nevertheless, gait recognition is still not the same
problem as action recognition. The set of possible actions is much wider than the
set of different gaits. Despite the fact, that each person has his/her own unique
manner of walk, the variation of gaits is quite small. While two actions are usually
easily distinguishable for human’s eye, the gaits of two people are much more close
and hard to differentiate. Most of the actions can be identified by the “appearance”
characteristics. For example, if a person holds a cellphone near the ear, he probably
talks to someone on the phone. Or if a persons figure is hanging in the air, he must
have jumped. But the gait is still the same “walking” action and if people have
similar complection and clothing or the resolution is low, the motion becomes more
important than people’s appearance.
To separate the problem of gait recognition from re-identification and face
recognition, one need to analyse the movements of the points rather than raw
frames containing the appearance. Besides this, most of the characteristics of image
27
appearance, such as the color or contrast, do not influence the gait properties (see
Fig. 2.1), which confirmes that the motion has to be extracted from the video.
Figure 2.1 — The same frame with different brightness, contrast and saturation
settings
For these reasons, the translations of the points are the data of great interest.
Mathematical object containing the information about these translations is optical
flow (OF), thus, it is optical flow that is chosen to be the main source of information
for gait recognition.
Formally, OF is the pattern of apparent motion of objects in a visual scene
caused by the relative motion between an observer and a scene [85; 86]. It reflects
the visible movement of the points of the scene between consecutive video frames
(Fig. 2.2). In mathematical terms, optical flow is a vector field responding to the
translation of the points. Assuming that 𝐼1 and 𝐼2 are two consecutive frames of the
video with width 𝑊 and height 𝐻, 𝑂𝐹 is a 3-dimensional tensor of size 𝑊 × 𝐻 × 2,
such that
first frame
second frame
optical flow
Figure 2.2 — Optical flow as a vector field of motion between consecutive frames
𝐼2 [𝑥 + 𝑂𝐹 [𝑥, 𝑦, 0], 𝑦 + 𝑂𝐹 [𝑥, 𝑦, 1]] = 𝐼1 [𝑥, 𝑦],
(2.1)
28
i.e. the point with 𝐼1 frame coordinates (𝑥, 𝑦) shifts to (𝑥 + 𝑂𝐹 [𝑥, 𝑦, 0], 𝑦 +
𝑂𝐹 [𝑥, 𝑦, 1]) on 𝐼2 . If the camera is static, the background does not move and change,
then only the moving objects generate the flow and zero vector field corresponds
to the background.
Optical flow usage was proposed for action recognition in [50] which was one
of the most successfull video classification approaches in its time and encouraged
lots of researchers to analyse optical flow additionally to raw frames. Since the gait
recognition is much more motion dependent than general actions, optical flow is
supposed to contain all the information needed for motion analysis and recognition
people by walking manner. Besides, using the OF maps as the sourse data one can
be sure, that the recognition is made not by person’s appearance, clothes, body
type or shape, but by motion patterns.
2.2
Proposed algorithm
Figure 2.3 — General algorithm pipeline
In this section, the pipeline of the proposed baseline algorithm is described
and the details of all the steps are explained. Figure 2.3 shows the common
scheme of recognition process. The whole algorithm consists of descriptor calculating
and their classification corresponding to the orange and green blocks on the
scheme, respectively. The feature calculation step, in turn, includes preparing and
preprocessing of the data, neural descriptors extraction and feature postprocessing
and aggregation to get one final gait descriptor. The following subsections focus on
each of these four main steps of the algorithm:
1. Data preprocessing;
2. Neural network backbone;
3. Feature postprocessing;
4. Classification.
29
2.2.1
Data preprocessing
As explained in section 2.1, the optical flow maps are chosen as the source data
for the algorithm. Unlike two further steps of feature calculating, the preprocessing
stage of the proposed algorithm is non-trainable, all the parameters and hyper­
parameters are chosen and fixed beforehand and no optimization is made. Thus, the
preprocessing can be made once and the data can be stored in a processed form
in order not to repeat the same calculations on each training and testing iteration.
For this reason, the first preparation step is the estimation of optical flow between
each pair of consecutive frames and saving it in a convenient format. Farneback [87]
algorithm was chosen for OF estimation due to its stability and the presence of
open source implementations in different computer vision libraries (e.g. OpenCV
library [88]). The algorithm consists of two steps: quadratic approximation of each
pixel neighbourhood by polynomial expansion transform and point displacement
estimation based on the transformation of the polynomials under translations. The
output of the estimator is a tensor 𝑂𝐹 described at the page 27. Inspired by [51],
we store these tensors as images. To that end, we make a linear transformation of
the maps, all the values of OF to lie in the interval [0, 255] similarly to common
RGB images. Besides this, since the tensor has 2 channels (horizontal and vertical
components of the flow), we add an auxiliary zero map to get three channel tensor
and to be able to save it similarly to raw frames. Two separate components of the
flow and the constructed three channel image are presented in the Figure 2.4.
horizontal component
vertical component
3-channel image
Figure 2.4 — The visualizations of OF components and the constructed image
Although optical flow maps contain information about motion themselves, in
order to enhance the temporal component and to use not only instantaneous but
continuous motion, the maps are used not separately, but jointly. In more details,
several consecutive maps are stacked into one block and such blocks are used as the
30
network input. So, instead of two channel maps, the network is fed by the tensors
with 2𝐿 channels, where 𝐿 is the number of frames in a block.
Since usually the frame contains not only the human figure and there can be
other visible moving objects and background, the bounding boxes containing exactly
the considered figure has to be cropped from the maps to get rid of the noise and
the errors. To get this box, any human figure detection algorithm can be applied.
However, in recent years lots of various investigations in human body analysis field
have been made, and human figure detections often turns out to be an auxiliary
problem that is solved together with other more complex challenges. For example,
human figure bounding box can be calculated while pose estimation. OpenPose
algorithm [89] estimates the location of the main joints of the body of all the people
presented in a frame, and having these locations found, one can easily calculate their
bounding box that is the bounding box for the whole figure. This method has been
chosen in the baseline method. In chapter 3 it will be shown how to use all the
information obtained from OpenPose algorithm for recogntion improvement.
To prevent relative shifts of the maps, one common bounding box is calculated
for the whole block. Specifically, let {𝐼𝑡 }𝑇𝑡=1 be the video sequence with a moving
person, {𝐵𝑡 }𝑇𝑡=1 be the bounding boxes containing the figure for every frame. It is
convenient to encode each box by the coordinates of its upper left and lower right
vertices 𝐵𝑡 = (𝑥𝑙𝑡 , 𝑦𝑡𝑢 , 𝑥𝑟𝑡 , 𝑦𝑡𝑙 ). Having the bounding box for each frame, the outer box
that contains all the boxes in block inside is constructed. For block that consists of
the OF maps for frames 𝐼𝑘 , 𝐼𝑘+1 , . . . , 𝐼𝑛 , the outer box coordinates are
𝑥𝑙 = min (𝑥𝑙𝑡 ),
𝑡=𝑘,...,𝑛
𝑦 𝑢 = min (𝑦𝑡𝑢 ),
𝑡=𝑘,...,𝑛
𝑥𝑟 = max (𝑥𝑟𝑡 ),
𝑡=𝑘,...,𝑛
𝑦 𝑙 = max (𝑦𝑡𝑙 ).
𝑡=𝑘,...,𝑛
This box is then made square by widening or lengthening (depending on the aspect
ratio) with maintaining the center of the box.
The OF images are stored together with all the bounding boxes, and prior to
inputting the data into the network the certain number 𝐿 of consecutive horizontal
and vertical components of OF maps are concatenated to one block and the general
square patch is cropped from the block. Various values of 𝐿 and arrangement of
31
Figure 2.5 — Input block. The square box cropped from 10 consecutive frames of
horizontal (the first and the second rows) and vertical (the third and the fourth
rows) components of normalized OF maps
the blocks were considered but the best results were received having 𝐿 = 10 and 5
frame overlap of the blocks. So, 𝑘-th block consists of the maps {𝑂𝐹𝑡 }5𝑘+5
𝑡=5𝑘−4 , 𝑘 ⩾ 1.
Figure 2.5 shows the cropped block obtained by such a procedure frame by frame.
2.2.2
Neural network backbone
The main part of the algorithm is based on neural networks. Being the
powerfull tool for many computer vision problems, the deep learning technique
is used in the proposed algorithm for embedding training and further feature
extraction. All the considered neural networks are trained for classification task
by classical backpropagation method. However, to get one general model which
does not require re-training in case of adding a new subject into a database, we
32
do not use a neural network as a classifier itself, but cut its head and compute the
hidden representations of the objects. Such an approach allows to make the feature
extraction process easier and reduce the model updating to re-fitting or fine-tuning
a simple classification model, which takes much less time and computations.
The following part of this section will describe the steps that are performed
in order to get neural network gait descriptors:
1. data augmentation;
2. neural network architecture selection;
3. training procedure;
4. feature extraction.
Data augmentation
As noted in section 1, one of the greatest problems preventing the general
solution of gait recognition challenge is a lack of data. Gait dataset collection is a
complex process and currently even the largest collections are relatively small. This
factor makes the application of deep learning methods harder, since even not very
deep architectures usually contain a large number of parameters to optimize and,
hence, large amount of data is required to be trained on. Thus, data augmentation
is needed to increase the number of different training objects.
During the training process, we firstly decrease the resolution of the frames to
88 × 66 to make the network input size not very large, and then construct the blocks
by random cropping a square containing the block bounding box of the figure. If
both sides of the box (after the proportional box transformation corresponding to
resolution decrease) are less than 60 pixels we uniformly choose the left and upper
sides (𝑥 and 𝑦, respectively) of the square so that 𝑥 is between the left border of
the frame and the left side of the box and 𝑥 + 60 is between the right side of the
box and the right border of the frame. The upper bound 𝑦 of the square is chosen
similarly. If any of the box sides is greater than 60 pixels (which happens rarely
since 60 is almost the whole height of the frame) we crop the square with the side
equal to the maximum of height and width of the box and then resize it to 60 × 60.
Thus, we get an input tensor of the shape 20 × 60 × 60, where 20 is the number of
channels corresponding to 10 pairs of horizontal and vertical components of optical
flow, and 60 is the spatial size of each map. In addition, we flip all frames of the
block simultaneously with probability 50% to get the mirror sequence. Since we use
33
the optical flow instead of raw images, the reflection is made as follows:
horizontal component: 𝑂𝐹 [𝑥, 𝑦, 0] := −𝑂𝐹 [𝑤 − 𝑥, 𝑦, 0],
vertical component: 𝑂𝐹 [𝑥, 𝑦, 1] := 𝑂𝐹 [𝑤 − 𝑥, 𝑦, 1],
where 𝑤 is the width of the frame equal to 66 due to resolution decrease.
These two transformations help us to increase the amount of the training data
and to reduce the overfitting of the model. Besides, before inputting the blocks
into the net we subtract the mean values of horizontal and vertical components of
optical flow over all training dataset.
Neural network architectures and training process
We consider several architectures of convolutional neural networks and
compare them.
CNN-M architecture
We started from the architecture used in [90] that is based on the CNN-M
architecture from [91] and used model based on this architecture as a baseline. The
only modification we made is using Batch Normalization (BatchNorm) layer [92]
instead of Local Response Normalization (LRN) [93] used initially. According to
numerous investigations, batch normalization allows to train each layer a bit more
independantly, which is very usefull when we extract the hidden representations
from the network. Besides this, this normalization prevents the unlimited growth
of activations allowing to increase the learning rate. And finally, it works as a
regularization and reduces the overfitting of the model, which is incredibly profitable
due to the relatively small amount of training data. Thus, BatchNorm has shown
the great accelaration and improvement in trainability and was selected as the main
normalization method while network training.
The architecture is described in Table 2.1.
The first four columns correspond to convolutional layers. They are determined
by the kernel size (the first row), number of output channels (the second row),
the presence of additional transformations following the convolution, such as
normalization, stride or pooling (the third and fourth rows), and the activation
function (the last layer). The first one has 96 filters of size 7 × 7, followed by
BatchNorm and 2 × 2 MaxPooling layer. The second convolution consists of 192
34
Table 2.1 — The baseline architecture
C1
C2
C3
C4
7×7
5×5
3×3
2×2
96
192
512
4096
BatchNorm
Pool 2
Stride 2
Pool 2
Stride 2
Pool 2
ReLU
ReLU
ReLU
ReLU
F5
F6
SM
4096
2048
subjects
number
Dropout
Dropout
ReLU
ReLU
SoftMax
filter with size 5 × 5 and 2 × 2 stride and pooling. Then the third layer comes
with 512 kernels of size 3 × 3 and the same stride and pooling parameters. The
convolutional part of the network ends with C4 layer which consists of 4096 filters
of size 2 × 2 without further dimensionality reduction. After this block three fully­
connected layers are applied. The fifth and sixth layers consist of 4096 and 2048
units, respectively, and they both are followed by dropout [94] layers with parameter
𝑝 = 0,5. The first six layers are followed by rectified linear units (ReLU) activation.
The last one is a predictional layer with the number of units equal to the number
of subjects in the training set and softmax non-linearity.
The model with such architecture and cross-entropy loss function was trained
by Nesterov Momentum gradient descent method with starting learning rate equal
to 0,1 reducing by a factor of 10 each time the error became constant.
Despite not very large size of the whole net, it was trained in four steps.
Starting with the fourth, fifth and sixth layers of size 512, 512 and 256, respectively,
the sizes of these layers were doubled each time the loss function stopped decreasing.
Two ways of widening the layers were considered: adding random values of new
parameters and net2net method [95] developed for rapid transfer of the information
from one network to another. The latter approach really does not lose any
information, and the validation quality of the net before widening and after it
is the same. However, it turns out, that such a saving of information obtained from
previous training steps increases the overfitting and worsens the testing quality.
And while adding randomly initialized new weights the random component works
as a regularization and improves the classification accuracy. Glorot [96] weight
initialization with uniform distribution was chosen for new weights sampling and
35
confirmed the positive influence of adding noise inside the model while training.
More precisely, let 𝐿𝑖 be a dense layer that has to be widen. Let 𝑛 be its initial size
and 2𝑛 be the required size. Then three network parameters should be modified:
the weight matrix 𝑊(𝑖−1)→(𝑖) and the bias 𝑏(𝑖−1)→(𝑖) between layers 𝐿𝑖−1 and 𝐿𝑖
and the weight matrix 𝑊(𝑖)→(𝑖+1) between 𝐿𝑖 and 𝐿𝑖+1 . Each of these parameters
doubles one of its dimensions, thus, the random matrix (or vector) of the same
shape is sampled from Glorot Uniform distribution and then concateneted with
the pretrained matrix. Because these layers are fully-connected, appending new
dimensions with random values adds some noise to layers outputs, but since these
values are distrubuted around zero, the noise turns out to be small enough and
regularizes the training.
CNN-M fusion
The first modification of the CNN architecture is a slow fusion of several
branches of the same network (similar to the idea from [84]). Instead of one big 10
frame block taken as an input, 4 consecutive blocks of the smaller size are used.
In more detail, 4 blocks of OF maps are taken, each one consisting of 5 pairs of
maps, and passed through distinct branches of convolutional layers with shared
weights. After the fourth convolutional layer, the outputs of the first two branches
and the last two ones are concatenated so that two streams remain instead of four.
These two outputs are passed through the fully-connected fifth layer after which two
outputs are concatenated to one and continue passing through remaining layers of
the network. Figure 2.6 presents the scheme of proposed architecture. This approach
allows to take a larger period of walking into account without losing instantaneous
features. This net was trained stage-by-stage, as well.
VGG-like architecture
The third architecture considered in this research was similar to the
architectures from VGG family [97]. The idea of using blocks of convolutions
with small kernel sizes has proven itself very successful in pattern recognition area.
Nevertheless, although each filter in VGG architecture has only 3 × 3 size, the depth
of the network makes the number of parameters extremely large relative to the
available dataset sizes. To prevent overfitting, the truncated VGG-16 architecture
was taken, namely, the last convolutional block was removed from the network.
The details are described in Table 2.2.
36
Figure 2.6 — The scheme of CNN-M fusion architecture
Table 2.2 — The VGG-like architecture
B1
B2
B3
B4
F5
F6
SM
3 × 3, 64
3 × 3, 64
3 × 3, 128
3 × 3, 128
3 × 3, 256
3 × 3, 256
3 × 3, 256
3 × 3, 512
3 × 3, 512
3 × 3, 512
4096
4096
subjects
number
Pool 2
Pool 2
Pool 2
Pool 2
ReLU
ReLU
ReLU
ReLU
Dropout Dropout
ReLU
ReLU
SoftMax
The first four columns correspond to blocks of convolutional layers. All the
convolutions have 3 × 3 filters, the first block consists of two layers of size 64,
the second one consists of two layers with 128 filters each, and the third and the
fourth blocks contain three layers with 256 and 512 filters, respectively. All the
convolutional blocks are followed by MaxPooling layers. After convolutional blocks,
there are three dense layers. Two of them consist of 4096 units and are followed by
dropout with parameter 𝑝 = 0,5, and the last one is softmax layer. All layers except
the last one are followed by ReLU non-linearity. This net was also trained step by
step, starting from the size of two fully connected layers equal 1024 and doubling
them when the validation error stopped decreasing.
In all the architectures 𝐿2 regularization on F5 and F6 layers was added to
cross-entropy loss function in order to avoid too large weights in dense layers leading
to overfitting. The regularization parameter increased step by step from 0,02 to 0,05.
37
Feature extraction
When the net is trained it is used as a feature extractor. The SM dense layer
is removed from the network and the outputs of the last hidden layer are used as
gait representations. The steps described above allow to get a feature vector for each
block of optical flow maps. During the testing phase of the algorithms these blocks
are constructed deterministically: no flips are made and the patches are cropped
choosing the mean values of the domains of left and upper bounds.
2.2.3
Feature postprocessing and classification
Having a feature vector for each OF block, one needs to get a final response
for this set of features, i.e. process the obtained descriptors, aggregate them into one
gait descriptor and classify it. This is what this section is focused on.
It is worth noting, that in the most trivial case when the subjects from the
training set are the same as the subjects from the testing one, the network itself can
be used as a classifier that is already fitted. It is equivalent to the logistic regression
of the probability distribution over subjects and, hence, the linear classification of
neural descriptors. Nevertheless, if training and testing subjects differ the classifier
requires fitting, so, the feature learning and classification are proposed to be made
separately.
As the gait feature classification method the Nearest Neighbour (NN) classifier
has been chosen. Being one of the simplest methods in machine learning, it
corresponds to the search of the most similar (according to some similarity measure)
object through the database. Despite its simplicity, this approach shows high
classification accuracy and often outcomes other popular classification methods such
as logistic regression or support vector machine (SVM).
Being a metric algorithm, nearest neighbour classification is much more
successful if the features are preliminarily scaled, thus, 𝐿2 normalization of each
block descriptor is made rigth after its extraction. However, the direct search of
the closest descriptor for each block does not result in a high recognition quality.
Since each block contains quite a short video subsequence, it does not capture all
the motion and the features learned from this data may depend on the location of
the considered block in a gait cycle. Thus, the closest descriptor may correspond to
the block with the same position belonging to other subject. To get rid of position
dependence, we firstly average the features over the whole video to one gait descriptor
38
and then search for the most similar one among other video descriptors. Additionally
to position independence, such aggregation allows to accelarate the further classifier
fitting and application, since the number of training and testing descriptors reduces
a lot: only one descriptor remains for each video regardless its length. Comparing
to voting method, where the results are aggregated after classifier applying, this
approach shows great gain in speed as well as in quality.
After the average gait descriptor is calculated for each video, the closest
one from the database is found and returned as the response. Additionally to the
“classical” Euclidian distance between vectors, Manhattan distance is considered
as the base for NN classifier. It turns out that this metrics is more stable and
appropriate for measuring the similarity of gait descriptors and shows a better
result in the majority of experiments.
The described process allows to make fast and accurate baseline classification
of neural descriptors. However, there are two auxiliry modifications that can be made
with the descriptors to improve the recognition. First of all, principal component
analysis (PCA) can be provided to reduce the dimensionality of feature vectors and
get rid of any noise they contain. Besides this, such a low-dimensional projection
accelerates the fitting of any classifier. Being a linear transformation, PCA could
be united with the last dense layer of the network, but the non-linear descriptor
normalization is made between them and the experiments show that such a solution
increases the model accuracy.
Figure 2.7 — The pipeline of the algorithm based on blocks of optical flow maps
The further recognition improvement can be made by Triplet Probability
Embedding (TPE) [98]. This method aims to find a low-dimensional discriminative
embedding (projection matrix) to make features of the same object closer to each
other than the features of different objects. It is trained to find the embedding
39
such that
𝑆𝑊 (𝑣𝑎 , 𝑣𝑝 ) > 𝑆𝑊 (𝑣𝑎 , 𝑣𝑛 ),
(2.2)
where 𝑆 = 𝑆𝑊 is a similarity which is usually defined as cosine measure, 𝑊 is a
parametrization of the embedding, features 𝑣𝑎 , 𝑣𝑝 belong to the same object, and
𝑣𝑛 belongs to the other one.
The inequality (2.2) is quite abstract, and the following optimization problem
is formulated to achieve it. Let us consider the probability of the triplet (𝑣𝑎 , 𝑣𝑝 , 𝑣𝑛 ):
𝑝𝑎𝑝𝑛
𝑒𝑆𝑊 (𝑣𝑎 ,𝑣𝑝 )
= 𝑆 (𝑣 ,𝑣 )
𝑒 𝑊 𝑎 𝑝 + 𝑒𝑆𝑊 (𝑣𝑎 ,𝑣𝑛 )
(2.3)
The closer 𝑣𝑎 and 𝑣𝑝 are relative to the similarity of 𝑣𝑎 and 𝑣𝑛 the higher this
probability 𝑝𝑎𝑝𝑛 is. Thus, in order to get the optimal parameters 𝑊 one can follow
the maximum likelihood method and maximize this probability or its logarithm
which is usually easier for optimization.
∑︁
𝐿𝑜𝑠𝑠 =
− log(𝑝𝑎𝑝𝑛 ) → min
(2.4)
𝑊
(𝑣𝑎 ,𝑣𝑝 ,𝑣𝑛 )
Hence, the problem of finding 𝑆𝑊 reduces to finding 𝑊 using any optimization
method. For the gait descriptors, this approach has been changed by using the
Euclidian distance instead of the scalar product. By defining the similarity as
𝑆𝑊 (𝑢,𝑣) = −||𝑊 𝑢 − 𝑊 𝑣||22 ,
the optimization problem of likelihood maximization turns to
2
𝑒−||𝑊 𝑣𝑎 −𝑊 𝑣𝑝 ||
log −||𝑊 𝑣 −𝑊 𝑣 ||2
→ max
𝑎
𝑝
𝑊
𝑒
+ 𝑒−||𝑊 𝑣𝑎 −𝑊 𝑣𝑛 ||2
𝑣𝑎 ,𝑣𝑝 ,𝑣𝑛
∑︁
(2.5)
that can be solved by stochastic gradient descent (SGD) method.
Additionally to improved performance, this embedding maintains the
advantages of dimensional reduction such as memory and time effectiveness.
Such an embedding shows the best performance and finishes the description of the
proposed algorithm. The whole model pipeline is shown schematically in Figure 2.7.
2.3
Experimental evaluation
In this section, the details of the experiments and their numerical evaluation
is described.
40
2.3.1
Evaluation metrics
Having a nearest neighbour classifier on the top, proposed approach outputs
an ID of the most probable subject considered video belongs to. Thus, the most
simple and convenient metrics to evaluate the quality of the algorithm is accuracy,
or Rank-1 reflecting the ratio of correctly classified objects. However, to make more
thorough comparison of all the models, Rank-k metrics is considered for 𝑘 = 5. This
measure equals the ratio of samples for which the correct label belongs to the set
of 𝑘 most probable answers. To evaluate Rank-k, one needs to predict not only the
most probable class, but the probability distribution over all the classes or 𝑘 most
probable classes. To find this 𝑘 most probable classes, the classes are sorted by the
distance to the test object. The distance between the class and the test object is
defined by the closest to test object class representative. Thus, all the gallery samples
are considered by distance increasing passing the samples corresponding to already
presented classes. Five “closest” classes are supposed to be the most probable ones
and are used for Rank-5 calculation.
2.3.2
Experiments and results
All the methods described above were evaluated on two colored gait datasets:
TUM-GAID [7] and CASIA Gait Dataset B [9]. For the first one classication train­
test split was selected: 150 training subjects and 155 testing ones provided by the
authors of dataset. With regard to CASIA dataset, there are no single training
protocol, different researchers split the collection differently. Since the baseline model
was proposed for side-view recognition, only the videos captured under 90° angle were
used in the experiments. Similarly to TUM, the dataset was splitted approximately
in half: 60 subjects were used for training the network and the remaining 64 subjects
formed the testing set.
While the training parts of the datasets were used for neural network and
TPE training, the evaluation was made on the testing subsets. Having 10 videos for
each person, we fit the nearest neighbour classifier on the first 4 “normal” walks and
evaluate the model quality on the rest of videos.
All the experiments can be divided into three parts.
1. Distinct learning and evaluating on each of the datasets;
2. Learning on one dataset and evaluating on both of them;
3. Joint learning on two datasets and evaluating on both of them.
41
These experiments aim to estimate the influence of different architectures and
embeddings on the quality of classification and compare with existing methods.
Besides this, the generalizability of these methods is investigated. Ideally, the
algorithm should not depend on the background conditions, the illumination,
proximity to the camera or anything other than the walking style of a person. So, it
is worth evaluating how the algorithm trained on one dataset works on the other one.
The first set of the experiments was conducted to evaluate the efficiency of
proposed approach in general. A network has been trained from scratch with the
top layer of size equal the number of training subject. Then the low-dimensional
embedding was trained and the index for the search was constructed on the fitting
data. The results of the experiments are shown in Table 2.3.
Table 2.3 — Results on TUM-GAID dataset
Method
Architecture
Embedding
(dimensionality)
Evaluation
Metrics
Rank-1 [%]
Rank-5 [%]
CNN-M
PCA (1100)
𝐿1
93,22
98,06
CNN-M
PCA (1100)
𝐿2
92,79
98,06
CNN-M
PCA (600)
𝐿1
93,54
98,38
CNN-M
TPE (450)
𝐿2
94,51
98,70
CNN-M fusion
PCA (160)
𝐿1
93,97
98,06
CNN-M fusion
PCA (160)
𝐿2
94,40
98,06
CNN-M fusion
TPE (160)
𝐿1
94,07
98,27
CNN-M fusion
TPE (160)
𝐿2
95,04
98,06
VGG
PCA (1024)
𝐿1
97,20
99,78
VGG
PCA (1024)
𝐿2
96,34
99,67
VGG
TPE (800)
𝐿1
97,52
99,89
VGG
TPE (800)
𝐿2
96,55
99,78
𝐿2
98,00
99,60
CNN-M + SVM [90]
42
The experiments show that each next architecture is more successive in
gait recognition than the previous one. Small filter kernels presented in VGG
model manage to learn more informative features than larger filters in SNN-M
model. Besides this, training low-dimentional TPE always increases the accuracy
as compared to common nearest neighbour classifier with 𝐿2 metrics. Nevertheless,
𝐿1 metrics often turns out to be more successful. Despite the fact that TPE was
trained as optimization of Euclidean similarity, the best result is achieved on the
NN classifier based on 𝐿1 distance between the embeddings of the descriptors. It
confirms that Manhattan distance is more appropriate for the analysis of neural
gait descriptors.
Although the accuracy of the proposed models does not exceed Castro’s [90]
Rank-1, the correct label is more often among the most probable ones.
In order to check how sensitive the proposed method is to small appearance
changes, the recognition quality for different clothing and carrying conditions was
evaluated separately. The results of such separation are presented in Table 2.4 and
show that the recognition in “normal” conditions is almost perfect while the change of
shoes and the presence of the backpack really make the identification more complex.
Table 2.4 — Quality of recognition on TUM-GAID dataset under different
conditions
Normal
Backpack
Shoes
Method
Rank-1 Rank-5 Rank-1 Rank-5 Rank-1 Rank-5
VGG, 𝐿1
99,7%
100,0%
96,5%
99,7%
96,5%
100,0%
CNN + SVM [90], 𝐿2
99,7%
100,0%
97,1%
99,4%
97,1%
99,4%
VGG-like architecture has been chosen as the main one for the further
experiments as it had given the best results on TUM database. Side-view CASIA
recognition based on the same VGG architecture was evaluated as well, however the
accuracy on this dataset turned out to be much lower: 71,80% of correct answers
using 𝐿2 distance and 74,93% in case of 𝐿1 metrics. It confirms that CASIA database
is a challenging benchmark combining small number of subjects and large condition
variability.
After the networks were trained on each of the datasets the second set of the
experiments has been conducted to investigate the generality of the algorithm, i.e.
43
TUM
CASIA
Figure 2.8 — Side-view frames from TUM and CASIA gait databases
the experiments on transfer learning. TUM-GAID videos and the side view of CASIA
look quite similar (see Fig. 2.8), so, it is checked if one algorithm can be used for both
datasets. Despite good results of the algorithms on TUM-GAID dataset, it turns out
that the accuracy of the same algorithm with the same parameters on CASIA dataset
is very low. The same happens in the opposite case when the network trained on
CASIA is used for extracting features from TUM videos. These experiments were
made using truncated VGG which is the best of considered architectures, and the
NN classifier based on 𝐿1 metrics without any extra embedding. Table 2.5 shows the
accuracy of such transfer classification. One can see that all the results on CASIA
Table 2.5 — The quality of transfer learning
Training Set
Testing Set
CASIA
TUM
CASIA
74,93%
67,41%
TUM
58,20%
97,20%
are still lower than on TUM-GAID and transfer of the network weights noticeably
worsens the initial quality.
The third part of experiments aimed to investigate if the stable model can
be constructed by means of combining of the datasets rather than using only one
of them. For this purpose, the neural network was trained on the the union of the
available data and evaluated on each of testing subsets. Since all the subjects in
two databases are distinct, they can be united into one dataset and the network can
44
be trained once on this union. The obtained model is supposed to learn the gait
features from both collections and thus be more general than the previous models.
Because two parts of the united dataset have different sizes, the input batches
had to be balanced while training the network. Each training batch was composed
of the same number of objects from both databases. Thus each OF block from the
“small” CASIA dataset has been sampled more frequently than blocks from TUM.
The evaluation of the algorithm was made individually for each dataset and the
results are shown in Table 2.6. Despite joint learning procedure and class balance,
the quality of such united model turns out to be lower than the quality of distinct
algorithms for each dataset. It reflects the overfitting of the latter models which was
not prevented by dropout and regularization.
Table 2.6 — The quality of the jointly learned model
TUM
CASIA
Model
Rank-1
Rank-5
Rank-1
Rank-5
VGG + NN, 𝐿1
balanced learning
96,45%
99,24%
72,06%
84,07%
VGG + NN, 𝐿2
balanced learning
96,02%
99,35%
70,75%
83,02%
2.4
Conclusions
The experiments show that optical flow contains enough information to make
gait recognition. Even if the shape of the human figure changes (for example, when
the bag or backpack appears) the nature of the motion is preserved and the person
can be recognized. However, the available datasets are quite small and not enough
for training general stable algorithm. The algorithms based on deep neural networks
overfit to dataset conditions and can work well if all the settings are fixed, while the
change of the recording conditions worsens recognition quility. Nevertheless, even
being imperfect, the transferred model still copes with unnkown data and makes
relatively accurate classification.
45
Starting from this baseline model, the approach is developed both ideologically
by investigating the motion in more detail and constructively by considering new
neural network architectures and training methods.
46
3. Pose-based gait recognition
This chapter focuses on the developement of the baseline method. Continuing
using optical flow as the main source of information, it is suggested that the
algorithm should consider the motion in more details in order to catch more
important gait characteristics.
3.1
Motivation
In this research, the problem of human identification in video is reduced
to gait recognition. According to medical encyclopedia [99], the gait combines
the biomotorics of the free limbs with the movements of the trunk and head in
which the mechanism of muscular coordination is regulated by the mechanisms
of movement, maintaining posture and body balance. However, despite the formal
medical definition, the gait is the manner of walk and some parts of human body
participate in walking process more than the others. Thus, the natural assumption
is made that different body parts affect the gait representation differently and some
of them are more informative than the others. For example, the hands and head
seem to be not very informative as while walking a person can move his/her hands
independently (waving to somebody, carrying a bag, speaking on the phone or just
keeping the hands in pockets) or spin the head looking around (see Fig. 3.1 for
visual examples). Thus, the features obtained from the hands’ and head’s motion
are supposed to be quite noisy and not usefull for recognition. Since the legs are the
limbs most involved to the walking, the bottom part of the body is expected to carry
more gait information and therefore to draw more attention than head or hands.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 3.1 — Gait independent motions of hands and head that are not informative
47
Consequently, it is proposed to estimate human pose, define some key points
to consider them in more detail and limit the analysis of optical flow maps to the
neighborhoods of such key positions. In so doing, we do not stop considering complete
motion of the full body: the hierarchy of multi-scale areas of human figure from full
body to exact joints is considered and the motion of the points in these shrinking
areas is observed.
Looking ahead, let us note, that sometimes the full coloured image sequences
are not available and only silhouettes are presented (as, for example, in OULP
dataset [8]) described in section 1.2. It can happen due to very bad lightnening
and low quility of recordings which makes the transitions between body parts and
pieces of clothing invisible or for legal reasons (personal data protection). In this
case, the pose can hardly be estimated, but the optical flow can be computed even
for such frames. However, being applied to silhouette masks, the OF reflects the
movement of external edges of human figure which is visualized in Figure 3.2. The
edges of the figure can also be considered as “parts of the body”, thus such case
is investigated as well.
Figure 3.2 — The visualization of optical flow between two silhouette masks (two
channels are the components of OF and the third one is set to zero)
3.2
Proposed algorithm
In this section, the details of proposed posed-based algorithm are described.
This approach inherits some of the ideas of baseline presented in section 2, including
the general pipeline which consists of data preprocessing based on optical flow, neural
features extraction and their further processing and classification. However, each of
48
these steps evolves and undergoes changes. The pipeline of the developed model is
presented in Figure 3.3. One can see, that although optical flow maps remain being
the input data for the neural network feature calculator, the structure of this input
has changed which entails a change in a post-processing stage, as well. Thus, let us
discuss the main steps of the approach in the following subsections.
Figure 3.3 — The pipeline of the pose-based gait recognition algorithm
3.2.1
Body part selection and data preprocessing
Similarly to the baseline model, the optical flow between consecutive video
frames is calculated in advance to be used for feature training and extraction. OF
maps are stored as images, as well, however, the structure of these images is being
changed. In the baseline model two channels of the image correspond to horizontal
and vertical components of the optical flow field and the third one is set to zero.
The modified version of these images contains the same first and second channels,
but the third one carries the magnitude of OF vector. Although this value correlates
with the first two, this additional information helps the neural network to learn
motion features.
The main innovation of this approach is the detailed considering of point
movements in different areas of human figure. Instead of OF blocks given up in this
investigation, the patches of optical flow from different body parts are inputted into
the network. The set of these body parts is selected for natural and empirical reasons.
As explained in section 3.1, the legs are supposed to be the most informative part
of human figure, thus, two patches of feet neighbourhood are chosen for recognition.
However, these patches cover quite small part of the figure which is obviously not
enough, so two “middle” areas are added to the set: lower and upper body parts, and
one “big” patch corresponding to the full body. Bounding boxes of these body parts
are shown in Fig. 3.4. Such a hierarchical body part set allows to consider the motion
49
all over the figure as well as in smaller areas to mine more precise information. The
idea of considering points motion maps in the neighbourhood of human body parts
is novel comparing to other gait recognition approaches based on optical flow.
Figure 3.4 — The bounding boxes for human body parts: right and left feet (small
blue boxes at the bottom), upper and lower body (green boxes in the middle) and
full body (big red box)
In order to construct the required patches, the pose key points have to be
found. For this reason, the pose of the human is estimated in each frame by finding
joint locations by OpenPose [89] algorithm. This method extends the convolutional
pose machines [100] approach based on applying of the cascade of several predictors
specifying the estimations of each other. At each stage, there is a neural network
with a small convolutional architecture predicting the heatmaps of joints locations.
Such a step-by-step procedure allows to increase the receptive field and observe the
whole image as well as local features leading to prediction refinement.
After the pose estimation stage, the frames in each video can be filtered
according to person’s presence in the frame. Since exactly the human motion
is considered, only frames containing the full figure are taken into account. So,
sometimes several frames at the beginning and the end of the video (where the
person is not in the frame entirely) are not used.
Having the key points found, the patches for network training are chosen in
the following way. The leg patches are squares with key foot points in their centers,
the patch should cover the whole foot including heel and toes, thus, the side of the
square is empirically chosen to be 25% of humans height. The upper body patch
contains all of the joints from the head to the hips (including the hands), and the
50
Figure 3.5 — The visualizations of optical flow maps in five areas of the human
figure
lower body patch contains all of the joints from the hips to the feet (excluding
the hands). Thus, each pair of consecutive frames produces five patches for use as
network inputs (see the visualizations of these patches in Fig. 3.5). This procedure
finishes the preprocessing step and data preparing.
3.2.2
Neural network pipeline
The fundamental component of the proposed algorithm involves the extraction
of neural features. In this section, the details of neural network architectures, training
process and feature extraction are provided.
Data augmentation
As the main difficulty of neural network based gait recognition is lack of
training samples, it is necessary to augment the data to decrease overfitting and
make algorithm more stable.
The data is augmented using conventional spatial transformations. In the
training process, four random integers are uniformly sampled for each of the five
considered body parts to construct the bounds of an input patch. The first two
numbers are left and right extensions chosen from the interval [0, 𝑤/3], where 𝑤 is
the width of the initial bounding box, while the latter two numbers are upper and
lower extensions in [0, ℎ/3], where ℎ is the height of the box. Two cropping methods
are applied with equal probability: with or without zero padding. If the extended box
goes beyond the frame borders, the patch can be supplemented by zero values in the
corresponing side, or cropped as is, which leads to patch stretching. The resulting
bounded patches are cropped from the OF maps and then resized to 48 × 48.
This augmentation allows us to obtain patches containing body parts
undergoing both spatial shifts and zoom. If the sum of the sampled numbers is
51
close to zero, a large image of the body part with a very little excess background
is obtained (see the left example on Fig. 3.6); otherwise, the image is smaller with
more background (the second left example). On the other hand, fixing the sums of
the first and the second pairs of bounds produces body parts with the same size but
changed location inside the patch (two right examples on Fig. 3.6).
Figure 3.6 — The frame with human figure bounding box (red) and the border of
probable crops (green) depicted and the examples of patches that can be cropped
from this one frame
Neural network architectures and training process
When the optical flow patches are prepared, they are used for neural network
training. An important feature of the proposed model is that the same network with
the same weights is used for each of the body parts. The patches are inputted into the
network interspersed with each other and no stream separation is needed. Moreover,
being applied for multi-view recognition, one general network is used for all the
available views, no preliminary separation is made and no information about the
viewing angles is required while training and inference. Unlike many other methods
(e.g. based on GEI) that construct the features depending on the view, the proposed
model takes all frames as input regardless the angle.
Two convolutional architectures are considered and compared for this
approach. The first one is the best of baseline model architectures, truncated
VGG-like network, the details of this architecture can be found at the page 36.
Applying the same architecture, we can make an exact comparison of the new
pose-based model with the baseline one and see if considering body parts really
improves the recognition. The staged training process described in section 2.2.2
remains the same with layer step-by-step widening and extra weights regularization.
52
However, despite data augmentation, dropout layer and weight regularization,
the quality on the testing set still gets behind the training one disclosing the
overfitting. Thus, the second architecture is considered not used in the previous
research. Unlike all the networks described earlier, which contain both convolutional
and dense layers, a fully convolutional model, namely ResNet [101], is proposed for
the task. This architecture contains residual connections which allow to transfer of
information from low to high levels, and thus the addition of each new block increases
the accuracy of classification. Besides this, the absence of fully connected layers in
the ResNet architecture reduces the number of parameters of the model which also
leads to the decrease of overfitting. Due to such structure, ResNet have become
one of the more successful architectures for image classification task and, hence, a
very popular and inspiring generally in computer vision. Despite all the advantages,
ResNet architectures are usually very deep requiring large number of layers for
performance improvement. And each new block significantly increases number of
parameters and lengthens the training process. Another problem is exploding or
vanishing gradients as a result of very long paths from the last to the first layer
during backpropagation. To avoid these problems, the Wide Residual Network [102]
with decreased depth and increased residual block width is chosen; the resulting
reduction in the number of layers made the training process much faster and allowed
the network to be optimized with less regularization. The details of the considered
Wide ResNet architecture are presented in Table 3.1.
Table 3.1 — Wide Residual Network architecture
B1
3 × 3, 16
BatchNorm
B2
B3
⎫
⎪
BatchNorm ⎪
⎪
⎪
⎪
⎬
3 × 3, 64
×3
BatchNorm ⎪
⎪
⎪
⎪
⎪
⎭
3 × 3, 64
⎫
⎪
BatchNorm ⎪
⎪
⎪
⎪
⎬
3 × 3, 128
×3
BatchNorm ⎪
⎪
⎪
⎪
⎪
⎭
3 × 3, 128
B4
⎫
⎪
BatchNorm ⎪
⎪
⎪
⎪
3 × 3, 256 ⎬
×3
BatchNorm ⎪
⎪
⎪
⎪
⎪
3 × 3, 256 ⎭
Each column in the table defines a set of convolution blocks with the same
number of filters. Similarly to VGG architectures, all of the convolutions have kernels
of size 3 × 3. The first layer has 16 filters and is followed by batch normalization
and then three residual blocks, each comprising two convolutional layers with 64
filters with normalization between them. This block construction is classical for
Wide ResNet architectures, and all of the blocks have similar structure. In each
53
successive group the blocks have twice as many filters as in the previous one. Each
first convolutional layer in groups B3–B4 has a stride with parameter 2; therefore,
the tensor size begins at 48 × 48 pixels and decreases to 24 × 24 and then 12 × 12
in groups B3 and B4, respectively.
Figure 3.7 — Consturction of residual blocks in Wide ResNet architecture
The detailed construction of residual blocks is shown in Fig. 3.7. The scheme
on the left corresponds to a common block without strides when the input and
output shapes coincide; that on the right defines the structure of the first block of
the group (columns B3 and B4) as the number of filters doubles and the size of the
map decreases. In this case, we add one auxiliary convolutional layer with a 1 × 1
kernel, stride, and doubled number of filters to make all of the shapes equal before
the summation. All the batch normalization layers are followed by ReLU activations.
In order to be classified, after going through all the residual blocks, the tensor
has to be transformed into 1-dimentional vector with number of components equal
to number of subjects which is the probability distribution over the subjects. For this
purpose, BatchNorm layer is added followed by average pooling which “flattens” the
sequence of maps 12 × 12 obtained following the convolutions to one vector. After
that, the last dense layer with softmax nonlinearity is applied.
Both networks were trained using Nesterov Momentum gradient descent with
the learning rate reduced from 0.1 by a factor of 10 each time the training quality
stopped increasing. Network training was implemented in Lasagne framework [103]
with a Theano backend on an NVIDIA GTX 1070 GPU.
3.2.3
Feature aggregation and classification
The network is trained using the augmented data and then used as a
feature extractor, with the outputs of the last hidden layer (AvgPooling) used as
54
gait descriptors. Although the model is trained to determine the subject identity
according to the patch of optical flow in any of considered body areas, in the
testing phase, the features are extracted for each frame and for each body part
and aggreagted into one pose-based gait descriptor. Let us explain the how this
final descriptor is obtained. First of all, instead of sampling random bounds for each
patch, their mean value is taken (𝑤/6 and ℎ/6, respectively) to locate the body
part in the center of the patch. Then, considering 𝑗 body parts one can obtain
𝑗 patches for each pair of consecutive frames and therefore (𝑁 − 1)𝑗 descriptors
for the video, where 𝑁 is the number of frames in the sequence. Two methods
for constructing one feature vector from all network outputs are investigated. The
first, a “naive” method, involves averaging all of the descriptors by calculating the
mean feature vector over all frames and all body parts. This approach is naive in
that the descriptors corresponding to different body parts have different natures;
even if we compute them using one network with the same weights, it would be
expected that averaging the vectors would mix the components into a disordered
result. Surprisingly, the accuracy achieved using this approach is very high, as will
be shown through comparison with another approach in the next section. The second
and more self-consistent approach is averaging the descriptors only over time. After
doing so, 𝑗 mean descriptors corresponding to each of 𝑗 body parts are obtained and
concatenated to produce one final high-dimensional feature vector.
The obtained gait descriptors are used for further classification. Similarly to
the baseline model, the descriptors corresponding to the same person are assumed to
be spatially close to each other, thus, the Nearest Neighbour classification method
is applied. The descriptors are normalized according to their Euclidean length and
projected into low-dimentional subspace using PCA algorithm prior to fitting and
classifying, and the closest one from the gallery is searched for each testing object.
As well as in section 2 both 𝐿1 and 𝐿2 metrics were considered and the first one
produced higher recognition quality.
The details and the results of the experiments evaluating the described
approach are presented in the next section.
55
3.3
3.3.1
Experimental evaluation
Performance evaluation
The general protocol of the experiments coincides with one described in
section 2.3. Experiments were conducted on all three available databases: TUM­
GAID [7], CASIA [9], and OULP [8]. The first one and 90° subset of the second one
are used for side-view recognition while full set of CASIA and OULP are employed
in multi-view experiments. All the data of each collection is separated into two parts,
one is used for feature extractor training and the other one remains for evaluation.
Except Rank-1 and Rank-5 metrics defined in the first chapter, the cumulative
match characteristics (CMC) curve was plotted to compare different techniques
more clearly.
Additionally to identification experiments the verification quality of different
methods was compared. All the videos for testing subjects were divided into training
and testing parts the same way as in the identification task. For each pair of training
and testing videos the algorithm returns the distance between the corresponding
feature vectors and predicts if there is the same person on both video sequences
according to this distance. To evaluate the quality of the verification we plotted the
receiver operating characteristic (ROC) curves of false acceptance rates (FAR) and
false rejection rates (FRR) and calculated equal error rates (EERs).
3.3.2
Experiments and results
The goal of the following experiments is to explore the influence of different
conditions on gait performance, including:
– network architecture and aggregation methods;
– body parts used for training and testing the algorithms;
– length of the captured walk;
– differences in training and testing datasets.
The results are compared with those produced during the few years before this
approach including the Fisher vectors based model [49] and multimodal features
(RGB frames, audio and depth) [104] which had demonstrated state-of-the-art
results in their time.
The first set of experiments aims at evaluating the method itself and comparing
different technical approaches such as network architectures, aggregation methods,
56
and similarity measures. To start with, the side-view recognition experiments were
conducted on TUM and CASIA collections.
All the algorithms were trained from scratch. The results of side-view
recognition on TUM GAID dataset are shown in Table 3.2, which makes it obvious
that the proposed approach is the most accurate outperforming all the state-of-the­
art methods. [47; 90; 105; 106]
Table 3.2 — Recognition quality on TUM-GAID dataset
Method
Architecture & Embedding
Aggregation
VGG
VGG
VGG
VGG
+
+
+
+
PCA
PCA
PCA
PCA
Wide
Wide
Wide
Wide
ResNet
ResNet
ResNet
ResNet
(1000)
(1000)
(500)
(500)
+
+
+
+
PCA
PCA
PCA
PCA
Baseline model, VGG
CNN + SVM [90]
DMT [105]
RSM [106]
SVIM [47]
DCS [104]
H2M [104]
PFM [49]
(230)
(230)
(150)
(150)
Metrics
Evaluation
Rank-1 [%] Rank-5 [%]
avg
avg
concat
concat
𝐿2
𝐿1
𝐿2
𝐿1
96,4
97,8
97,4
98,8
100,0
100,0
99,9
100,0
avg
avg
concat
concat
𝐿2
𝐿1
𝐿2
𝐿1
98,3
99,2
98,8
99,8
99,9
99,9
99,8
99,9
𝐿1
𝐿2
97,5
98,0
98,9
99,9
99,6
—
92,0
84,7
99,2
99,2
99,2
—
—
—
—
99,5
In Table 3.2, “avg” denotes a naive approach for feature aggregation in which
the mean descriptor is computed over all body parts, while “concat” defines the
concatenation of descriptors, which produces better results.
It is worth noting that the best in its time PFM [49] approach takes input
frames of initial size 640 × 480 without resolution change, while the inputs of
the proposed method are reduced requiring less time and memory for processing.
However, the quality of the algorithm not only achieved the same level as PFM
57
but even outperformed it. The Wide ResNet architecture shows the best recognition
accuracy while having far fewer parameters than VGG-like one, thus this architecture
is supposed to be more appropriate for gait recognition task and is taken as the
main one in further experiments.
Since the appearance and shape should not affect the gait representation, the
recognition quality under different conditions presented in the dataset is evaluated
as well. The results reflected in Table 3.3 confirm that proposed approach is stable
to the variations of shoes and carrying conditions and the model accuracy remains
quite high with small changes of silhouette and appearance.
Table 3.3 — Recognition quality on TUM-GAID dataset under different conditions
Method
Pose-based model
Baseline model, VGG
CNN + SVM [90]
DCS [104]
H2M [104]
Normal
Backpack
Shoes
Avg
100,0%
99,7%
99,7%
99,7%
100,0%
96,5%
97,1%
99,0%
99,4%
96,5%
97,1%
99,0%
99,8%
97,5%
98,0%
99,2%
99,4%
100,0%
98,1%
99,2%
Side-view recognition investigations continue with CASIA 90° view
experiments. To affirm the superiority of pose-based method over the block-based
one, the comparison of these two methods is made and the results are shown in
Table 3.4. Despite the small size of training set unsufficient for construction of
the stable VGG-like baseline model, the novel approach with lightweight Wide
ResNet backbone achieves quite high recognition accuracy on testing part of the
collection and shows less overfitting.
Table 3.4 — Recognition accuracy on CASIA lateral part
Architecture
Aggregation
Wide ResNet + PCA (150)
Wide ResNet + PCA (130)
Wide ResNet + PCA (170)
Wide ResNet + PCA (170)
Baseline model, VGG
avg
avg
concat
concat
Metrics
Rank-1 [%]
𝐿2
𝐿1
𝐿2
𝐿1
𝐿1
85,1
86,7
84,8
93,0
74,9
Similarly to Table 3.3, Table 3.5 shows how different clothing and carrying
conditions presented in CASIA database influence on recognition quality. The data
58
variability in this collection is more complex than in TUM-GAID, as coat changes
the silhouette and hides the joints more than coating shoes, thus the decomposition
of results on this dataset into the components with different conditions is more
informative. One can see, that while the presence of the bag does not worsen the
recognition, putting on a coat actually does. Such situation is predictable as the coat
is not just another clothing of different style. Unlike conventional indoor clothes,
a coat is outerwear which hides human figure, changes its shape, and makes the
motion of many points of the body invisible which leads to the change of inner gait
representation. Thus such dress-up is a really challenging factor complicating gait
recognition. However, the accuracy decrease is not dramatic and even the in-coat
part-based recognition outperforms average baseline results.
Table 3.5 — Recognition quality on CASIA lateral part under different conditions
Method
Wide ResNet, 𝐿1
Baseline model, VGG
Normal
Bag
Coat
Avg
100,0%
100,0%
78,8%
93,0%
94,5%
78,6%
51,6%
74,9%
To investigate the general algorithm applicability, multi-view gait
recognizability has to be verified. For this purpose, one more experiment has
been conducted on CASIA database. Unlike all previous experiments, this one has
covered not only the lateral-view part of the collection, but full dataset including
11 different viewing angles. Several multi-view testing protocols are suggested for
this dataset, the one proposed in [60] was chosen here. Following this protocol, the
network is trained on all the available data for the first 24 subjects (10 videos per
subject shot at all 11 angles) and the evaluation is made on the rest 100 subjects.
Table 3.6 shows the average cross-view recognition rates at three probe angles:
54, 90, and 126 degrees. The gallery angles are all the rest 10 angles (excluding
the corresponding probe one). We compare our results with [60] showing the
state-of-the-art result on CASIA and SPAE [64] where one uniform model is trained
for gait data with different viewing and carrying condition variations. For two of
three considered angles, pose-based method achieves the highest accuracy and even
outperforms the other methods on the average.
To complete the evaluation of multi-view recognizability, the experiments on
OULP collection [8] have been conducted. Additionally to verifying the applicability
of the proposed approach to multi-view data, these experiments aim at investigation
59
Table 3.6 — Average recognition rates for three angles on CASIA database
Model
Average Rank-1 [%]
Probe View
54°
90°
126°
Mean
WideResNet + PCA (230), concat, 𝐿1
77,8
68,8
74,7
73,8
SPAE [64]
63,3
62,1
66,3
63,9
Wu [60]
77,8
64,9
76,1
72,9
of recognition possibility in case of partial information. Since OULP is distributed
in a form of binary silhouette masks, only the movements of external edges are
computable, and it is worth exploring if such information is enough to human
identification.
Two testing protocols are proposed for this gait dataset. The first one uses
the subset of 1912 subjects and 5 different splits of this subsets in half provided by
the authors. The final recognition metrics is obtained by averaging the results over
these splits. The quality is evaluated for each pair of probe and gallery angles, but
to be short, 85° is usually presented as the gallery view while the probe views are
comprised all other angles. The second protocol is based on five-fold cross-validation
(cv) on all the available 3,844 subjects shot by two cameras. In this protocol, the
results are usually aggregated over the angular difference between probe and gallery
views. In both cases, the final classifier is fitted on the gallery frames from the first
walk and tested on the probe frames from the second walk. Since different researchers
conduct their experiments following different protocols, both of them were used to
make the investigation thorough.
Firstly the proposed method was compared with several successful deep and
non-deep approaches: view transformation models wQVTM [43] and TCM+ [44],
methods LDA [38] and MvDA [107] based on discriminant analysis, bayesian
approaches [36; 56], and GEINet [59]. To make the full comparison, the quality
of all methods was evaluated not only for conventinal classification task, but also
for verification. Table 3.7 presents Rank-1 and EER metrics while the graphical
comparison is shown in Fig. 3.8. The curves for all the other methods except the
proposed one were provided by their authors.
60
Table 3.7 — Comparison of Rank-1 and EER on OULP dataset
Rank-1 [%]
Model
EER [%]
Probe view
55°
65°
75°
55°
65°
75°
92,1
96,5
97,8
0,9
0,8
0,8
95,9
97,8
98,5
0,6
0,5
0,5
GEINet [59]
LDA [38]
MvDA [107]
wQVTM [43]
TCM+ [44]
DeepGait + Joint Bayesian [56]
81,4
87,5
88,0
51,1
53,7
89,3
91,2
96,2
96,0
68,5
73,0
96,4
94,6
97,5
97,0
79,0
79,4
98,3
2,4
6,1
6,1
6,5
5,5
1,6
1,6
3,1
4,6
4,9
4,4
0,9
1,2
2,3
4,0
3,7
3,7
0,9
Joint Bayesian [36]
94,9
97,6
98,6
2,2
1,6
1,3
Wide ResNet, 𝐿1
Wide ResNet, 𝐿2
extended training set
The second row of Table 3.7 shows the quality of the model trained on all
available subjects except 956 testing ones. Such extended training set containing
2,888 subjects allowed to improve the recognition and outperform the state-of-the­
art method for some view angles. Despite being very high the results achieved using
this extended set are not highlighted as they are obtained following the other training
protocol.
Although the recognition accuracy of the proposed method is a bit lower than
the best one [36], the quality of verification turns out to be higher. Optical flow based
method has the lowest EER and the corresponding ROC lies below all other curves.
To compare the algorithm with two more state-of-the-art cross-view
recognition methods [60; 61], the second evaluation protocol was used. The accuracy
comparison with these two approaches is presented in Table 3.8. The cross-validation
results confirm the ability of the optical flow based method to recognize people at
different viewing angles. The accuracy of the algorithm is a bit lower than [61], but
outperforms [60]. Besides, as mentioned above, unlike these methods the proposed
one does not require the information whether the frames have been shot at the
same or different viewing angles.
Thus, the experimental results revealed that the algorithm can be generalized
to multiview and can obtain rather high accuracy even in case of partial information.
61
Figure 3.8 — ROC (the first row) and CMC (the second row) curves under 85°
gallery view and 55°, 65°, and 75° probe view, respectively
Table 3.8 — Comparison of Rank-1 on OULP dataset following five-fold cv protocol
Rank-1 [%]
Model
Angular difference
0°
10°
20°
30°
Mean
Wide ResNet, 𝐿1
98,4
98,2
97,1
94,1
97,0
Takemura [61]
99,2
99,2
98,6
97,0
98,8
Wu [60]
98,9
95,5
92,4
85,3
94,3
LDA [38]
97,8
97,1
93,4
82,9
94,6
They show that even the absence of any texture inside the human figure keeps
recognition possible and finish the first set of investigations.
The second part of experimental process concerned the human figure, namely,
the body parts that are most important for gait recognition. As mentioned in
section 3.1, different parts of the body are supposed to affect the gait representation
differently and the lower part seems more important for recognition. Nevertheless,
the results of OULP dataset recognition revealed that full-body features are quite
informative, therefore, different sets of body parts were compared. Additionally
to full set of five parts introduced in section 3.2.1, each separate body part was
62
considered (full body, pairs of legs patches, upper, and lower body) and a set of “big
and middle” parts: full body, upper body, and lower body. The latter set was chosen
in order to check if the presice consideration of small leg areas is really necessary
or big patches are enough.
For each of the sets a new model was trained based on corresponding body
parts and then evaluated using concatenation as pose aggregation method and 𝐿1
metrics as distance measure. The comparison of these models on TUM dataset is
presented in Table 3.9.
Table 3.9 — Recognition quality of models based on different sets of body parts on
TUM-GAID dataset
Body parts
Rank-1 [%]
Rank-5 [%]
legs
79,7
86,5
upper body
96,2
99,7
lower body
96,3
99,6
full body
98,9
100,0
full body, upper body, lower body
99,4
100,0
full body, upper body, lower body, legs
99,8
99,9
According to the top rows of the table, the larger the considered area is the
higher accuracy is obtained. Full body based model is the most successfull among
“one-part-based” models confirming that the corresponding patch is, indeed, the
most important one. However, this model is not perfect and makes a mistake in
more than 1% cases. Even composition of three “big and middle” patches (full
body, upper body, and lower body) does not improve the quality completely despite
outperforming each of three distinct models. The best result is obtained by five­
parts-based model leading to the conclusion that the precise consideration of legs
surroundings is not excessive and improves the classification.
The third avenue of interest was determining the length of video needed for
good gait recognition. In all of the preceding experiments, entire video sequence was
used, totaling up to 130 frames. Tables 3.10 and 3.11 list the results of testing the
algorithms on shortened sections of TUM-GAID and CASIA sequences (following
both multi-view and side-view protocols). This experiment has not been conducted
on OULP dataset because the period of constant view is about 30 frames in its videos
which is already quite short. Thus, the further shortening was supposed worthless.
63
Since the network in trained on distinct maps without any aggregation, it
does not depend on the length on the sequences and the same model can be used in
this experiment. The only change is made while testing: instead of entire sequence
only their first 𝑛 frames are used for feature calculation. We consider 𝑛 = 50, 60, 70
for TUM-GAID dataset and 𝑛 = 50, 70, 90, 110 for CASIA-B database as all the
sequences there are longer. Due to the fact that all the frames are preliminarily
filtered to contain fully visible human figure and the considered sections are still
longer than one gait cycle, the starting and finishing positions are not important
and do not influence on the recognition.
Table 3.10 — Recognition quality for different video lengths on TUM-GAID dataset
Length of video
Rank-1 [%]
Rank-5 [%]
50 frames
94,3
94,9
60 frames
97,5
97,8
70 frames
99,4
99,5
full length
99,8
99,9
Although the length of the gait cycle is about 1s, or 30 − 35 frames,
such short sequences are found to be insufficient for good individual recognition.
The experiments conducted on two datasets verify that increasing the number of
consecutive frames for classification improves the results. The expanded time of
analysis is required because body point motions are similar but not identical for
each step and, correspondingly, using long sequences makes recognition more stable
to small inter-step changes in walking style. It is worth noting that video sequences
may consist of a non-integer number of gait cycles but still be accurately recognized.
Similarly to baseline model, in a final set of experiments, the stability and
transferability of the proposed algorithm is examined. To check if the model can be
generalized, it was trained on one dataset and evaluated on the other one without any
fine-tuning. Since OULP collection differs from two coloured datasets, the algorithms
are obviously not transferable between them, thus, this database was not used in
this experiment. Besides this, only side-view recognition can be investigated because
there are no other angles in TUM database. Nevertheless, such a restricted model
transferability was considered and the result of cross-dataset eveluation is shown
in Table 3.12. The recognition quality deteriorated following the dataset change,
but not so dramatically. Let us notice, that the accuracy of transferred pose-based
64
Table 3.11 — Recognition quality for different video lengths on CASIA-B dataset
Video length
Cross-view
Side-view
Average Rank-1 [%]
Rank-1 [%]
Probe view
54°
90°
126°
Mean
90°
50 frames
66,85
60,95
64,05
63,95
90,33
70 frames
67,05
62,70
66,50
65,42
91,38
90 frames
67,10
66,95
69,95
68,00
92,42
110 frames
75,25
68,45
73,20
72,30
92,68
full length
77,80
68,80
74,70
73,77
92,95
Table 3.12 — The quality of transfer learning
Testing Set
Training Set
CASIA
TUM
CASIA
92,7%
66,5%
TUM
76,8%
99,8%
algorithm on CASIA database is higher than the accuracy of directly applied baseline
algorithm (compare with Table 2.5). Nevertheless, because of changing the training
dataset, the recognition accuracy decreases by 17% for CASIA dataset and by about
33% for TUM. This suggests that the algorithm overfits and that the amount of
training data (particularly in the CASIA dataset) is not sufficient for constructing
a general algorithm despite the smaller size of the network.
3.4
Conclusions
Conducted research and the results of the experiments can lead to the following
conclusions. First of all, high recognition quality of the model based on distinct
full-body patches reveals that optical flow itself contains enough information for
recognition by motion and long temporal blocks are excessive. It allows to make the
model much lighter and recognition faster. Secondly, the most important information
65
for gait recognition is the movement of external edges reflected in the optical
flow applied to silhouette sequences. Thus, the proposed method can be applied
even in case of partial information. Nevertheless, collecting additional information
from regions around the joints improves the results, surpassing the state-of-the-art
approaches.
66
4. View resistant gait recognition
According to previous research, single-view recognition turns out to be much
easier than cross-view one. The quality of identification when the viewing angles
are close to each other is higher, which leads to the need to modify the models and
make them more stable to view changes. Current chapter is devoted to the methods
of view resistance increasing, two approaches are proposed for this, analyzed and
evaluated experimentally.
4.1
Motivation
View variation is one of the main difficulties in gait recognition problem. This
factor changes only inner gait representation, i.e. it is obvious for human’s eye that
there is the same person captured under different angles (see Fig. 4.1), nevertheless
computer gets two absolutely different video sequences and it is a complex challenge
to make a model that understands that these sequences correspond to the same
person. Even when all the other conditions remain (the same background and
environment, lightening, camera settings, clothes, shoes and carrying objects), the
change of viewing point leads to different feature generation and the greater the
difference between the views is, the further gait descriptors are from each other
making the classification more complex. Considering Nearest Neighbour method
Figure 4.1 — The example of the same walk captured under different viewing angles
as the final classifier of gait descriptors, one needs to make the descriptors of the
same subject as close to each other as possible regardless the viewing angle the
67
video had been captured under. At the same time, the vectors corresponding to
different subjects should be far from each other in order the spatially close vectors
to have the same labels. Despite the mismatch of formal definitions, such a problem
formulation is quite close to clustering. Indeed, many clustering techniques (e.g.
Linear Discriminant Analysis [38]) aim at making the features of the same subjects
closer while making the features of different subjects further from each other.
Figure 4.2 — The desired cluster structure of the data
According to these reasonings, the features trained by neural network should
be given a cluster structure as shown in Figure 4.2. In this chapter, two techniques
are proposed for this purpose. One of them is View Loss, it affects the structure
of the descriptors while neural network training by penalizing for view dependence,
and the second one is cross-view triplet probability embedding (CV TPE) inspired
by TPE [98] approach introduced in section 2.2.3. It modifies the features after
extraction from the network and decreases the view dependency, as well. Although
both methods are model-free and can be applyed to any algorithm, here they are
intergarated into developed pose-based gait recognition method. The pipeline of the
algorithm with these two modifications is shown in Figure 4.3. All the stages of
the algorithm except two novelties remain the same as describes in chapter 3. The
details of both approaches are described in the following sections.
68
Figure 4.3 — The pipeline of gait recognition algorithm with two modifications
(view loss and CV TPE) shown in green boxes
4.2
View Loss
To overcome the problem of multiview recognition, one needs to decrease the
dependence of the gait features on the view. Thus, the main goal is to train the
model so that the features of the same person are similar for different views. For
this purpose, the penalty is proposed for the features of one subject too different
from each other. This penalty is implemented as the special loss function added
while network training.
Each of neural networks introduced in previous sections had been trained
for classification task to predict the probability distribution over the set of
training subjects. The conventional classificational loss function is cross-entropy,
or logarithmic loss (LogLoss) defined as
𝐿𝑜𝑔𝐿𝑜𝑠𝑠 (𝑝, 𝑦) = −
𝑛
∑︁
𝑦𝑖 log (𝑝𝑖 ) ,
𝑖=1
where 𝑝 is a predicted distribution vector and 𝑦 is one-hot encoded ground truth.
Additionally to this loss, the auxiliary View Loss is proposed to prevent the large
distance between features of the same subject obtained under different views. To
make the vectors close to each other regardless of the viewing angle, it penalizes
for the difference between the representations of videos with one certain view
and averaged representation of videos with different views. Thus, instead of direct
optimization
(︀
)︀
ρ 𝑑α , 𝑑β → min ∀α, β,
where 𝑑α , 𝑑β are descriptors of videos with views α and β, respectively, and ρ is any
distance metrics, the following task is considered
(︃
)︃
𝑚
𝑚
∑︁
∑︁
1
1
β
ρ
𝑑α𝑗 ,
𝑑𝑗 𝑗 → min ∀α, β𝑗 ,
(4.1)
𝑚 𝑗=1
𝑚 𝑗=1
69
β
where all the descriptors 𝑑α𝑗 have the same view α while vectors 𝑑𝑗 𝑗 may be obtained
under different viewing angles β𝑗 .
In details, let 𝑖 be an ID of a subject, α be an angle. Let us consider two
data batches: the first one {𝑏𝑖,α } consists of the data for subject 𝑖 with view α, and
the second one {𝑏𝑖 } consists of data for the same subject 𝑖 but captured under any
angle. Let 𝑑𝑖,α be the average of last hidden layer outputs of the first batch, and 𝑑𝑖
be the average hidden output of the second batch.
Since the hidden representations are used as the gait features and the nearest
one is searches for in the database, we need to bring 𝑑𝑖,α and 𝑑𝑖 closer. To achieve
it, we consider such pairs of batches and the following loss function for them:
𝐿𝑟𝑒𝑔 = λ𝐿𝑣𝑖𝑒𝑤 + 𝐿𝑐𝑙𝑓 ,
(4.2)
where 𝐿𝑣𝑖𝑒𝑤 = ||𝑑𝑖,α − 𝑑𝑖 ||2 is the described view loss corresponding to optimization
task (4.1), 𝐿𝑐𝑙𝑓 = 𝐿𝑜𝑔𝐿𝑜𝑠𝑠({𝑏𝑖,α }) + 𝐿𝑜𝑔𝐿𝑜𝑠𝑠({𝑏𝑖 }) is conventional cross-entropy
for classification applied to both batches, and λ is relative weight of view term in
total loss. The view term added to conventional loss function can be considered as
the regularization to prevent view memorizing. It is worth noticing, that different
network outputs are used in these two loss terms. While the intermediate (hidden)
representations are brought together in 𝐿𝑣𝑖𝑒𝑤 , the final outputs reflecting the
probability distribution are taken in 𝐿𝑐𝑙𝑓 . Such a strategy aims to enhance the
feature extractor character of the network.
Since such batches have to be sampled in a special way (the subject ID is
fixed inside the batch and the angles are fixed in some batches, as well), the network
can not be trained exclusively on such batches. So, the conventional optimization
steps based on random batches are alternated with these regularizing steps. The
network makes “view optimization” once in 𝑘 steps. The whole algorithm for one
epoch appears to be as follows:
for 𝑗 in range (epoch_size) do
Sample random batch;
Make optimization step: 𝐿𝑐𝑙𝑓 → min;
if 𝑗 mod 𝑘 = 0 then
Sample random subject 𝑖, random view α;
Sample batches {𝑏𝑖,α }, {𝑏𝑖 };
Make regularized optimization step: 𝐿𝑟𝑒𝑔 → min;
end if
70
end for
Such regularized optimization steps obviously increase the unregularized loss
but prevent the fall into the local minimum. However, firstly the network has to be
trained in a conventional way and achieve any “optimal point”. Thus, the following
training process has been implemented:
1. Unregularized training, until LogLoss stops decreasing;
2. Regularized training with λ = 10−3 , 𝑘 = 20 until both classification and
view losses stop changing;
3. Unregularized training, until LogLoss stops decreasing.
The first step leads to the initial model described in chapter 3. However, when
the loss stops decreasing, the addition of regularizer pulls the solution out from
local minimum if it is not optimal and helps to change the optimization direction
and to descent to “real” global minimum. The curve of logarithmic loss behaviour
during training process is plotted in Figure 4.4. As was supposed, the model got
into the local minimum after the first step of training, and the second regularized
step has pulled it out from this minimum. Despite the increase of LogLoss due to
extra loss term, the third step of optimization leads to the lowest position which
was unreacheable without regularization. It is hardly noticeable on the graph, but
the loss after the third step is 0,18 which is 13% lower than 0,21 after the first step.
Figure 4.4 — The curve of the training loss during three steps of training process
The decrease of the training loss offers the hope that the quality on test set
improves as well. The confirmation of this assumption is presented in section 4.4.
71
4.3
Cross-view triplet probability embedding
Although the training process described in previous section aims to get
rid of view dependence, one more approach based on TPE [98] is proposed to
increase view resistance more. Despite the fact that TPE was initially introduced
for for verification task, it can successfully be applied for classification as shown
in section 2.2.3.
The details of TPE are explained in section 2. It is motivated by
inequality (2.2) and reduced to conventional optimization problem (2.5). However,
TPE was introduced without respect to multi-view recognition and does not
take angles under consideration. Here, its modification is proposed that uses the
information about the view while training and brings the features of the same
subject with different angles closer to each other.
Figure 4.5 — The desired mutual arrangement of the elements in a triplet
In multiview mode, the features of the same object captured under different
views have to be closer to each other than the features of different objects captured
under the same angle, thus, the condition (2.2) turns to the following one
𝑆𝑊 (𝑣𝑎,α , 𝑣𝑎,β ) > 𝑆𝑊 (𝑣𝑎,α , 𝑣𝑛,α ),
(4.3)
where 𝑣𝑎,α , 𝑣𝑎,β are the features of the same object 𝑎 captured under different views
(α ̸= β), and 𝑣𝑛,α corresponds to the other object captured under the same view α
as 𝑣𝑎,α . The visualization of this condition is presented in Figure 4.5. In doing so,
the optimization problem (2.5) transforms to a new one
2
𝑒−||𝑊 𝑣𝑎,α −𝑊 𝑣𝑎,β ||
log −||𝑊 𝑣 −𝑊 𝑣 ||2
→ max .
−||𝑊 𝑣𝑎,α −𝑊 𝑣𝑛,α ||2
𝑎,α
𝑎,β
𝑊
𝑒
+
𝑒
𝑣𝑎,α ,𝑣𝑎,β ,
∑︁
𝑣𝑛,α
(4.4)
72
Similarly to initial embedding, parameter matrix 𝑊 is found by SGD method. Let
us denote
𝑢 = 𝑣𝑎,α − 𝑣𝑎,β ;
𝑣 = 𝑣𝑎,α − 𝑣𝑛,α .
(4.5)
Since
𝑑
||𝑊 𝑥||2 = 2𝑊 𝑥𝑥𝑇 ,
𝑑𝑊
the gradient obtained by one triplet 𝑣𝑎,α , 𝑣𝑎,β , 𝑣𝑛,α equals
)︁
𝑑 (︁
2
−||𝑊 𝑣𝑎,α −𝑊 𝑣𝑎,β ||2
−||𝑊 𝑣𝑎,α −𝑊 𝑣𝑛,α ||2
||𝑊 𝑣𝑎,α − 𝑊 𝑣𝑎,β || − log 𝑒
+𝑒
=
∇𝑊 = −
𝑑𝑊
(︁
)︁
1
−||𝑊 𝑢||2
𝑇
−||𝑊 𝑣||2
𝑇
𝑇
𝑒
(−2𝑊 𝑢𝑢 ) + 𝑒
(−2𝑊 𝑣𝑣 ) =
− 2𝑊 𝑢𝑢 − −||𝑊 𝑢||2
+ 𝑒−||𝑊 𝑣||2
𝑒
2
(︀
)︀
𝑒−||𝑊 𝑣||
𝑇
𝑇
− 2𝑊 𝑢𝑢 + 2𝑊 𝑢𝑢 + −||𝑊 𝑢||2
2𝑊
𝑢𝑢
−
2𝑊
𝑣𝑣
=
𝑒
+ 𝑒−||𝑊 𝑣||2
𝑇
𝑇
2
(︀
)︀
𝑒−||𝑊 𝑣||
𝑇
𝑇
2𝑊
𝑢𝑢
−
2𝑊
𝑣𝑣
(4.6)
2
2
𝑒−||𝑊 𝑢|| + 𝑒−||𝑊 𝑣||
Thus, at each step of gradient descent, matrix 𝑊 is being updated according
to the following rule:
2
(︀ 𝑇
)︀
𝑒−||𝑊 𝑣||
𝑇
𝑊 =𝑊 +2
·
𝑊
𝑢𝑢
−
𝑣𝑣
,
−||𝑊 𝑢||2 + 𝑒−||𝑊 𝑣||2
𝑣𝑎,α ,𝑣𝑎,β , 𝑒
∑︁
(4.7)
𝑣𝑛,α
where 𝑢 and 𝑣 are denoted for each triplet 𝑣𝑎,α , 𝑣𝑎,β , 𝑣𝑛,α according to equations 4.5.
Being embedded into the constructed subspace the representations become less view­
dependent and more id-dependent. The features of the same subject get closer even
if they are obtained under different angles.
To make the training more effective, hard negative mining has been
implemented. At each step, the triplet is sampled as follows:
Sample random subject ID 𝑎 and two random angles α, β;
Sample anchor and positive element 𝑣𝑎,α , 𝑣𝑎,β ;
𝑑𝑚𝑖𝑛 = inf;
for 𝑗 in range (negatives_count) do
Sample random subject ID 𝑛 among all subjects except 𝑎;
73
Sample negative candidate 𝑣ˆ𝑛,α and calculate 𝑑𝑎,𝑛 = ||𝑊 𝑣𝑎,α − 𝑊 𝑣𝑛,α ||;
if 𝑑𝑎,𝑛 < 𝑑𝑚𝑖𝑛 then
𝑑𝑚𝑖𝑛 = 𝑑𝑎,𝑛 ;
𝑣𝑛,α = 𝑣ˆ𝑛,α ;
end if
end for
return (𝑣𝑎,α , 𝑣𝑎,β , 𝑣𝑛,α , 𝑑𝑚𝑖𝑛 )
where inf should be any large number and is set to 104 and negatives_count is
number of negative candidates and is set to 200. Thus, the closest negative object
in current subspace is selected to improve the trainability of the embedding and
fasten the process.
4.4
Experimental evaluation
To evaluate the influence of both developed approaches separate experiments
have been conducted for them and after that the models have been united to check
if View Loss and CV TPE are interchangeable or not. Since this research concerns
multi-view recognition, TUM database is not appropriate for this purpose, thus,
only CASIA dataset was used for the experiments.
In the first experiment the influence of view loss was investigated. The
network was trained in three stages as described in section 4.2. After each of the
steps the validation has been conducted to evaluate the recognition quality on
test set. Following the same protocol as for part-based model (see section 3.3.2),
the aggregated accuracies were calculated for each probe angle. To compare the
proposed approach with state-of-the-art models more thoroughly, the frontal view
(0°) recognition results are presented additionally to 54°, 90°, and 126° angles
considered earlier, since this view is one of the most challenging looking particularly
different from the others. The numerical results of the evaluation are shown in first
four rows of Table 4.1. Despite the fact that LogLoss on training data has increased
after the second step, the accuracy on testing part after this step is higher for
most of the angles. During the last step the loss on training set has decreased
and achieved the smallest value, and the final quality on testing data is also the
highest after this step. As noticed in section 2.2.3, feature normalization prior to
kNN classifier application is usually very effective, which is confirmed in the third
and the fourth rows of the table.
74
Cross-view triplet probabilistic embedding is trained separately from the
network, based on any set of feature vectors. The embeddings were trained for
two models: the initial pose-based model from chapter 3 and view loss based, to
check if both approaches worth applying or one of them is enough for sufficient
recognition quality. Besides this, the conventional TPE was trained to evaluate the
influence of “cross-view” component. In all the cases the embeddings were trained
on neural features for 24 training subjects (the same as for neural network training)
and then applied to the test set.
The comparison of all the models with state-of-the-art approaches is also
presented in Table 4.1. It demonstrates the superiority of the combination of
developed approaches over their separate usage. One can see, that training
regularization by view loss increases the average accuracy by 5,4 percentage points
(from 71,6% to 77%) and additional cross view embedding increases the quality by
another 3 percentage points. The result turns out to be almost 2 percentage points
higher than the accuracy of state-of-the-art GaitSet approach [63]. It confirms that
two proposed concepts complement each other allowing to outperform other existing
methods.
Let us notice that there are two different rows corresponding to baseline part­
based model and view loss based model after step 1. Although these two models were
obtained by the same procedure, the latter one (as well as all the models described
in this chapter) was implemented in PyTorch neural network framework [108] and
has achieved higher quality comparing to the first model implemented in Theano.
4.5
Conclusions
In this chapter, two novel approaches that increase gait recognition view
stability were presented. Both of these concepts are model-free and can be integrated
in any neural network multi-view recognition model: view loss can be added to
any loss function of any neural architecture, and the embedding is applicable to
any model prior to feature classification. Being applied to developed pose-based
model, each of the modifications leads to the improvement of data cluster structure
and thereby increases recognition quality. Moreover, the joint application of two
approaches improves the model even more, allowing to achive the state-of-the-art
quality.
75
Table 4.1 — Comparison of average cross-view recognition accuracy of proposed
approaches and state-of-the-art models
Average accuracy [%]
Probe angle
Method
0°
54°
90°
126°
View loss, step 1
59,7
80,1
68,9
77,7
View loss, step 2
53,4
79,8
70,3
81,4
View loss, step 3
64,4
81,8
72,6
81,1
View loss, step 3 + normalization
65,9
83,6
74,6
83,7
Part-based model + TPE
56,9
82,6
73,5
82,4
Part-based model + CV TPE
62,6
84,6
75,6
84,4
View loss, step 3 + CV TPE + normalization
69,3
86,3
75,8
86,0
-
63,3
62,1
66,3
Wu [60]
54,8
77,8
64,9
76,1
GaitSet [63]
64,6
86,5
75,5
86,0
SPAE [64]
76
5. Event-based gait recognition
One of the main advantages of gait as an identifier is that it can be captured
at a large distance and does not require high resolution and perfect exactness.
Particular characteristics of appearance are not so important as the movements
of the points. This feature makes gait recognizable not only in conventional videos
with colored frames changing 25 − 30 times per second but in other data formats
reflecting the points motion. One of such data formats is event stream obtained from
Dynamic Vision Sensor (DVS) [109]. These sensors have gained popularity recently
and often replace conventional cameras, thus, the recognizability of gait in the event
stream is worth investigating. This chapter focuses on the possibility of application
of developed gait recognition methods for event-based data and is structured as
follows. In the first section, the description and important characteristics of DVS are
presented, they are followed by the sections devoted to event stream processing and
algorithm modifications, and in the end of the chapter the experimental evaluations
are presented to verify the recognizability of such data.
5.1
Dynamic Vision Sensors
Firstly, let us describe what DVS is and why it is increasingly being used
instead of conventional cameras. These sensors are inspired by natural human
retina, they similarly capture changes in light intensity on a per-pixel basis. The
output of DVS is the stream of events each reflecting the fact that the intensity
of the point with spatial coordinates (𝑥, 𝑦) has increased or decreased at time 𝑡.
Once the magnitude of intensity change gets beyond some threshold, a positive or
negative event is generated. Pixel intensity changes are recorded 103 times per second
generating the asynchronous event stream at microsecond time resolution. Visually,
this event stream can be presented as a point cloud (with (𝑥,𝑦,𝑡) 3d-coordinates
and the color corresponding to the sign of intensity change). Mathematically, an
event is a tuple (𝑥, 𝑦, 𝑡, 𝑝), where 𝑥 and 𝑦 are coordinates of the pixel accepting the
integer values in range [0, 𝑊 ] and [0, 𝐻] (with 𝑊 × 𝐻 being the spatial resolution
of the sensor), 𝑡 is a timestamp calculated in microseconds, and 𝑝 ∈ {1, 0} is the
polarity of the event corresponding to the increasing (𝑝 = 1) or decreasing (𝑝 = 0)
of pixel intensity. If the sensor is static the events are generated only by moving
objects or when the lightening changes. Since the natural light usually changes
77
slowly and smoothly, and artificial lightening turns on and off relatively seldom,
the majority of the events appear due to real motion, and the background points
generate events rarely. This allows not to store the unnecessary information and
reduce time required for data transfer.
Besides this, due to high temporal resolution it can capture very fast motion
which looks fuzzy on the frames obtained by conventional cameras. Additionally to
temporal sensitivity, it is extremely sensitive to intensity and can capture very small
changes invisible for digital cameras. The threshold for event generation depends on
the exact device, but it is always very low and reacts to small brightness changes.
Thus, the sensor can get information even in case of very bad lightening when
conventional camera outputs almost black frames.
One more advantage is very low power consumption relative to RGB sensors.
DVS need less energy and hence do not require powerfull accumulators which makes
them more lightweighted and easier to use. Due to all these features, dynamic vision
sensors have become wide spread in many fileds. For example, they are currently
being developed for home assistance systems (“smart home”) where energy saving
is very relevant.
5.2
Data visualization
Since the aim of the reasearch is to recognize the person by the walking manner
we need to know the mutual arrangement of points and their relative changes rather
than discrete events in distinct points. Hence, a visualization of the event stream
is required to be able to deal with them similarly to conventional video frames.
The reconstruction of an initial scene from the stream is a challenging problem,
and at the time of the study there were no qualitative methods of reconstruction.
Nevertheless, the flow itself contains much information that is enough for many
computer vision applications. Thus, it is proposed to make a simple visualization
which carries information about the content of video being easily calculated.
Several ways of transferring the DVS signal into video are proposed by different
researchers. One of them which seems the most natural and often used for event
stream visualization (see, e.g. [77]) is selected in this investigation. To construct this
visualization, a temporal window of a certain length is considered for each point
and the sum of all the events is calculated for the point in this time interval (see
Fig. 5.1). The length of the window is usually set to 0,04 seconds to get 25 fps
78
frame rate. The polarities of the events are normalized prior to summation to get
±1 values, thus, the negative and positive events occured very close to each other
(during one time interval) can reduce each other preventing noise appearance. Due
to such way of aggregation the sequence of frames is obtained with zero value in
the areas where nothing happened and the more changes occured in a point the
larger the module of corresponding value is. Thus, these frames can be interpreted
as the measure of happenings.
Figure 5.1 — The transformation of the event stream into a sequence of frames
However, although the events obtained by the sensor are generated due to
the changes in the scene and thus correlate with the motion of the points, they do
not always follow the motion (because the lightening is still one of the reasons of
event generations) and even following they do not show where a point has moved.
It distinguishes the event stream from the optical flow which reflects exactly the
apparent spacial movements of the points between the frames. Thus, the event
stream visualizations do not provide information about motion directly, but they
are supposed to contain enough data for recognition.
It is worth noting that although these constructed images are not real frames
and many appearance characteristics get lost, the moving human figure is visually
recognizable in these images and this figure is the only moving object in the frame
due to stationarity of the background and chosen aggregation method. The example
of visualized frame is shown on the left side of Fig. 5.2. Having these event images
one can apply different image-based methods of detection, pose estimation, and gait
recognition to them and check the transferability of these methods to event streams.
79
Figure 5.2 — The examples of event-based image(left) and blob detection result
(right)
5.3
Recognition algorithm
In this section, the details of the event stream recognition algorithm are
presented. Although the aggregation transforms the stream into the sequence of
images which can be processed similarly to conventional video frames, these images
have another meaning and the direct application of the developed methods is
not always implementable. Thus, to transfer the algorithm to such data, some
modifications have to be implemented. The whole pipeline of the algorithm is shown
in Figure 5.3 and almost replicates the conventional pose-based algorithm. The
following subsections focus on the distinct steps of the algorithm that are modified
or need some explanations due to the change of source data, namely
1. human figure detection;
2. pose estimation;
3. optical flow estmation.
The rest of the operations are implemented as explained in section 3.2.
Figure 5.3 — The pipeline of the event-based gait recognition algorithm
80
5.3.1
Human figure detection
The first thing that should be done to recognize a subject is to detect it. Thus,
the detection algorithm for event-based data is the topic of the first subsection.
Since the sensors are usually applied for video surveillance (for example, in home
assistance systems), they are fixed leading to the static background on the captured
data. In this case, the points of the background do not generate the events and
most of the events appear because of the moving objects, namely the people that
have to be identified.
Although human figure on the event-based images can be noticed visually
and even look quite similar to the figure on the grayscale image, most of the modern
complex neural detection methods applied “out-of-the-box” get confused considering
such inputs and make mistakes. Nevertheless, due to the motionless background,
the human figure is the only object leading to event generation, so, the “semantical”
detection is not needed and can be reduced to the separation of a non-uniform human
from the background that is almost monophonic. Thereby, a basic blob model was
constructed which considers the average relative deviation of pixel brightness in
each row and column of visualized stream from the median value and detects when
these deviations exceed some threshold. Due to the lack of noise on the images the
threshold is very low and is set to 0.4 − 0.45%. The rows and columns with the
first and the last spikes of brightness are assumed to be the borders of human figure
bounding box. Such a heuristic approach is very quick not requiring any training
and works well when all human’s body parts move, but if some part of the body
stops moving for a moment (it periodically happens to hands or feet) it becomes
invisible and the figure can be detected not completely, so, to be sure that the
boxes cover the whole figure we widen them by 10 percent each side. The example
of detection is shown on the right side of Fig. 5.2. One can see that the legs are
not easily separable from the background, thus the widening of the box is really
needed. However, such approximate bounding boxes are indeed easy computable
and enough for further human pose estimation.
Depending on whether only full height figure or several body parts are used in
the model these blob detections can be used as final boxes or as auxiliary data for
pose estimation. It will be discussed in the next subsection that in the latter case
the influence of detector mistakes on the final recognition decreases.
81
5.3.2
Pose estimation
Due to the results of study described in chapter 4, more precise consideration of
body parts increases identification accuracy compared to the full-body classification.
Thus, it is assumed that such an approach can improve the recognition quality
for event images, as well. Similar to human figure detection, the state-of-the-art
methods trained for pose estimation in conventional RGB images do not cope with
event-based ones. Hence, the existing models are to be modified to be applied to
the event-based data. The Stacked Hourglass Network [110] was chosen since it is
well-trainable and has an adequate number of parameters which makes it easily
and quickly applicable.
Stacked Hourglass Network is a neural model consisting of several “hourglass”
residual modules allowing to capture information across different scales. This model
does not solve the regression problem and approximates not the exact locations
of human joints, but the heatmaps of their position, considering the Gaussian
distribution centered at real joint coordinate as a target. The chosen architecture
containing two stacked hourglasses was initially trained on the MPII Human
Pose [111] dataset and achieves 86,95% of correct key points (PCKh).
Being very successful Stacked Hourglass Network has one disadvantage
compared to other state-of-the-art approaches: it processes only one person in a
frame and requires an approximate location of this person as input. However, since
the human figure detection turns out to be quite a simple procedure not requiring
a lot of time and computational power, it is used for preliminary analysis, and the
obtained bounding boxes of the figures (their center and side) are inputted into the
pose estimation model along with the frames themselves. The main requirement to
the boxes is that they should contain the target human figure inside and not contain
any other figures. Since these conditions are usually satisfied, the detector allows
estimating pose accurately enough for further body parts finding.
Nevertheless, even knowing the approximate human body bounding box, this
network does not find the correct locations of the joint on event-based images, and
thus could not be used directly for body parts detection. To fine-tune the model,
we have simulated event-based images for CASIA [9] dataset and annotated it
automatically using the OpenPose library [89]. OpenPose method is actually more
general than the Stacked Hourglass Network being able to estimate the pose of
several people simulateneously. It does not require any bounding boxes, but finds all
82
the figures in a frame and for each person returns the coordinates of joint locations
and their confidence. Nevertheless, this algorithm is more complex in training, thus
only its inference procedure has been performed to get automatic annnotation, and
more simple Hourglass model was taken for training event-based pose estimator.
The details of the simulation algorithm are described in section 5.4.
Because of the relatively small number of subjects and a large number of
conditions CASIA database is very challenging for gait recognition. Nevertheless,
since it contains thousands of frames where human full height figure is depicted,
being labeled this dataset is perfect for the purpose of walking pose estimation.
Annotating the CASIA database allowed to get a large DVS dataset labeled with
pose key points that can be used for network training.
Figure 5.4 — Examples of the skeletons estimated by fine-tuned Stacked Hourglass
model
The exact quality of the estimator whas not been evaluated as it is just
an auxiliary problem, but visually the results are close to the ground truth. The
examples of several skeletons evaluated for event-based images are shown in Fig. 5.4.
The first two images correspond to almost perfect evaluation, all the key points are
located correctly. Two other examples reflect the most common mistakes that occur
while pose estimation: leg invisibility leading to “one-legged” skeleton and “noisy
thorax” when specifically this key point turns out to be misdetected.
Similarly to baseline conventional model, the skeletons are not inputted
directly into recognition model, but are used to compute the “areas of interest” to
consider the optical flow more precisely. Thus, the thorax key point can be thrown
away, since its natural location is always strictly inside the box containing head,
hands, and hips, thus this point should not affect the boundaries of the body part
83
bounding boxes. Such an heuristic helps to get rid of a “noisy thorax” problem. The
“one-leg” problem is not as easy to solve as the previous one. First of all, this problem
is much more common. Being on the ground the supporting leg does not move and,
thus, no events are generated for a while in this area and the corresponding leg
becomes invisible as shown on the third image of Fig. 5.4. And legs not only influence
on the boundaries of the boxes, but completely define some of the boxes. For these
reasons, the reduced set is considered containing only three big parts corresponding
to upper and lower body parts and the full body. The full five-element set and full­
body model are considered, as well, to repeat the original model and compare the
results. It is worth noting that despite the presence of human figure bounding boxes
obtained from the detector, the full-body boxes in a pose-based model are computed
anew according to the joint location. It allows to refine the detection and cut off the
excess background that could be caught because of noise in event-based images. The
Fig. 5.5 shows the difference between the initial bounding box and joint-based one.
Figure 5.5 — Human figure bounding boxes after blob detection (left) and after
refinement by pose estimation (right)
5.3.3
Optical flow estimation
Since we transfer the developed OF-based method to event-based sequences,
the main source of information is optical flow in the surrounding of the human
figure. Although the visualizations constructed from the event stream are not real
reconstructions, they do have natural sense. As mentioned in section 5.2, all the
events occured in each spatial position during some time interval are aggregated
while visualizing, thus, these frames can be interpreted as the measure of happenings.
84
Beside this, in the absence of overlapping it can be assumed that nothing moves
inside the bounding box except the human figure itself, so, it is supposed to be the
only moving object of the scene, and while the event aggregation the presence of the
person in each pixel is measured. It makes the event frames similar to silhouettes
which also reflect the presense of a human body in each pixel. It is shown in the
previous study (see chapter 3) that optical flow computed from the binary silhouettes
contain enough information for identification, therefore, the OF based on event
frames is supposed to be recognizable, as well.
As well as in all developed methods the optical flow is computed by Farnebäck
algorithm [87], but unlike the previous approaches the third channel is not added
to horizontal and vertical components of optical flow and the maps are not stored
as images. In order to save all the information about pixel movements without any
compression and discretization, the raw optical flow is stored in a binary format
and normalized prior to cropping and inputting into the network. Fig. 5.6 shows
horizontal and vertical components of such maps after normalization.
Figure 5.6 — Two components of event-based optical flow maps
5.4
Event-based data
To evaluate the proposed algorithm modification, the gait collection captured
by DVS is required. Similarly to classical gait recognition, the lack of data is the
most challenging problem. Moreover, unlike conventional case where some datasets
exist and can be used, there are no publicly available event data appropriate for
gait recognition task, thus, the event stream has to be simulated. Several ways of
DVS data generation have been proposed by now, but depending on the purpose
85
of simulation, some of them are not applicable for gait recognition task. The most
common simulation way is generating a short event streams by sensor motion around
some static scene. However, the gait sequences need to be captured by the static
sensor, hence, this approach is not appropriate. Besides this, since the gait is a very
special motion, it is hardy generable from scratch, the initialization is required, thus
it is proposed to simulate streams based on the existing gait collections.
The data obtained by DVS has two main features: each event represents the
fact of brightness increase or decrease, but not the value of change, and events are
generated asynchronously with a very high frequency (several thousand times per
second). Tracking of brightness differences between the consecutive frames is not a
challenging task since one just need to compare the values of every pixel in each
pair of frames. The main problem is the frequency of events as the frame rate of
all the videos in gait datasets is 25 − 30 fps which is several orders of magnitude
less than the sensor generates.
In conventional low frequency video sequences each pixel shifts a lot between
frames, so, the events cannot be correctly generated directly. To get real events
intermediate frames have to be approximated. A simple linear approximation
proposed in [112] is selected for our purpose. Firstly colored RGB frames are
transformed to grayscale according to formula 𝑌 = 0.299𝑅 + 0.587𝐺 + 0.114𝐵,
and then the intermediate intensity of each pixel is interpolated linearly between
consecutive frames. Each time when the intensity changes by more than some value
𝐶, the new event is generated in the corresponding pixel. Despite the simplicity and
naivety of this approach, the resulting event stream turns out to be rather close to
the real data, and the images constructed by aggregation look quite natural.
To estimate the simulation visually, several real recording were made by DVS
and the frame sequences were constructed and compared. Figure 5.7 shows two
event-based images. The simaluted data (on the right) has some more noise around
the human figure, but this noise is hardly noticeable and all the details of the figure
are even more clear than on the real frame (on the left). The visual comparison
allows to assume that the simulation is close enough to really captured stream and
the experiments can be conducted on such data.
Two gait datasets were transformed into event streams for training and
evaluation of the model. TUM-GAID database was chosen as the main dataset for
the experiments due to its simplicity. Videos from this collection embody perfect
situation when a person is visible from the side and there are no overlapping.
86
real data
simulated data
Figure 5.7 — The visualizations of the real and simulated event streams
Nevertheless, considering that this study has been the first to its time to solve DVS
gait recognition problem, such data could estimate the workability and efficiency of
the proposed model prior to applying it to more complex real data.
The second database, CASIA-B, was used for pose estimator training. Being
very challenging for gait recognition due to small number of subjects, this collection
contains 110 video sequences for each person and is a perfect set for pose key point
detection. CASIA was also transformed into event-based form leading to about 1.7M
labeled frames applicable for pose estimation. To accelerate the training process and
make the data less complex, the dataset was filtered and only videos captured under
the viewing angles from 54 to 144 degrees were taken, as in cases of more steep
angle some body parts can get out of frame. Hence about 770 000 of annotated
frames were used for pre-trained model fine-tuning.
5.5
Experimental evaluation
The experiments conducted in this study aim to check if a person can be
recognized by the gait in the event stream and how accurate such recognition is. In
the experiments, the detection methods and the influence of different body parts on
the recognition were compared. The evaluation protocol and metrics replicate the
ones described in chapter 3 to be able to compare RGB and DVS models.
Firstly the following basic model was examined. The classification network was
trained to predict the person by the OF patches containing the full-height figure.
The bounding boxes to be cropped were found by basic blob detector and the pose
estimation model was not used at all. The pipeline of this model is presented at the
87
top of Fig. 5.3 (yellow and green blocks). The accuracy of this model is shown in
the first row of Table 5.1. Since this study is the first to recognize the gait from
event-based data there had been no state-of-the-art methods to compare the results
with. However, regardless of any other approaches the recognizability of the data
itself can be investigated and comparison with conventional video can be made. For
this reason, the quality of the original method is presented in the last row of the
table. The accuracy of event-based model is about 2% less than conventional one
which demonstrates that tracking the events of brightness change instead of the total
scene observation almost does not lose the information about the motion and, thus,
does not deteriorate the recognition. The probable reason for this result is the lack of
noise due to the dynamic nature of the data. The sensor reacts only to the changes
that exceed some threshold ignoring too small spikes, and such “data cleansing”
counterbalances the loss of effective information due to event consideration instead
of real pixel brightness values.
After the basic model was constructed, it was modified by adding the pose
estimation prior to OF patch cropping. This advanced model is shown in yellow and
blue blocks of Fig. 5.3. Due to the “one-leg” problem and not obvious usefulness
of the optical flow around the legs two sets of body parts were considered: all five
parts and three big parts of the body. The second and the third rows of Table 5.1
represent the comparison of these models with the corresponding approach based on
RGB videos. The model based on three body parts achieves higher accuracy than
the five-parts-based one. It presumably happened because of the redundant noise
in the surroundings of the legs and probable misdetections of invisible legs in many
frames. Note, that the best accuracy obtained by three big body parts is only 0,8%
less than the original model quality.
Table 5.1 — The quality of the end-to-end model on simulated TUM dataset
Body part set
Rank-1 [%]
Rank-5 [%]
Basic full body
98,0
99,7
Pose-based, three parts
99,0
100,0
Pose-based, five parts
98,8
99,8
Original RGB pose-based model
99,8
100,0
88
Besides this, the influence of detection quality on the whole model was
evaluated. The simplest moving object detector was used and its mistakes could
worsen the final result, thus we have conducted the same experiments, but using
more accurate bounding boxes, computed from initial RGB videos by background
subtraction procedure. All the other steps of the algorithm remain the same. Due
to the static background the result of the subtraction is the silhouette mask of
moving person which can be easily bounded, so, such bounding boxes are supposed
to be really close to the ground truth. Using these boxes the effectiveness of optical
flow itself is evaluated without the influence of misdetections. The results of such
“pure” evaluation of the basic model and the best pose-based one are presented in
Table 5.2. Although the accuracy of both models increased a bit, the results remain
Table 5.2 — The quality of the perfect detected model on simulated TUM dataset
Body part set
Rank-1 [%]
Rank-5 [%]
Basic full body
98,3
99,6
Pose-based, three parts
99,1
99,8
Original RGB pose-based model
99,8
100,0
quite close to the end-to-end model. It shows that probable mistakes of detector
do not decrease the quality greatly.
Additionally to traditional classification problem, to complete the
investigation, the verification task was considered. For each pair of probe and
gallery videos the classififcation problem is stated whether they belong to the same
person or not. The Euclidean distance between the neural descriptors is taken as
the metrics and the classifier was constructed according to all possible thresholds
in order to get a ROC curve and calculate ROC AUC and EER (see Fig. 5.8 and
Table 5.3). The curves difference is hardly noticeable, thus, an auxiliary figure in a
large scale is added to show that the curve corresponding to the three-part-based
DVS model is closer to the upper-left corner of the plot than all the others.
The numerical results provided in Table 5.3 also show that event-based model
surpasses the original RGB-based one in verification task. Although the difference
is quite small both AUC and EER turn out to be higher in event-based model than
in conventional one.
89
Figure 5.8 — The ROC curves corresponding to three considered models and the
original RGB model: fullydepicted (left) and large-scale (right) to show the
difference
Table 5.3 — Results of verification on TUM-GAID dataset
Body part set
ROC AUC
EER [%]
Basic full body
0,9985
1,89
Pose-based, three parts
0,9994
1,10
Pose-based, five parts
0,9992
1,34
Original RGB pose-based model
0,9975
1,26
5.6
Conclusions
The research described in this chapter has verified the recognizability of the
event streams obtained by the dynamic vision sensors. Despite the very simple and
rough way of data visualization, the event-based frames contain enough information
for human recognition. Due to the fact that gait recognition algorithm developed
in this work and particularly in chapter 3 considers the motion of the points but
not the appearance or any colour characteristics, it can easily be transfered to event
data containing the information about the dynamics in the scene. Besides this, while
solving the gait recognition problem, object detection and pose estimation problems
were considered. The visualization method selected in this study makes the event
flow visually similar to grayscale images semantically close to silhouette masks which
makes the application of conventional methods natural.
90
The results of conducted experiments show that being applied to the videos
containing single moving person modern computer vision methods achieve very high
quality close to colored-video-based models and even superior to them in verification
task. Hence, it turns out that the event-based data is not only presumably similar
to the conventional one, but this closeness is confirmed experimentally. Such results
encourage to apply conventional methods to the DVS data and continue developing
them.
91
Conclusion
In this thesis, the novel gait recognition approach is presented. Due to
variability of influential factors and absense of one general database covering all
possible conditions, the Author has selected some of the factors to overcome and
proposed a method stable to their changes.
The main contributions of this thesis are as follows.
1. Side-view gait recognition method which analyses the points translations
between consecutive video frames is proposed and implemented. The
method shows state-of-the-art quality on the side-view gait recognition
benchmark.
2. The first multi-view gait recognition method based on the consideration
of movements of the points in different areas of human body is proposed
and implemented. The state-of-the-art recognition accuracy is achieved for
certain viewing angles and the best at the investigation time approaches
are outperformed in verification mode.
3. The influence of the point movements in different body parts on the
recognition quality is revealed.
4. The original research of the gait recognition algorithm transfer between
different data collections is made.
5. Two original approaches for view resistance improvement are proposed and
implemented. Both methods increase the cross-view recognition quality and
complement each other being applied simultaneously. The model obtained
using these approaches outperforms the state-of-the-art methods on multi­
view gait recogntion benchmark.
6. The first method for human recognition by motion in the event-based data
from dynamic vision sensor is proposed and implemented. The quality close
to conventional video recognition is obtained.
The further developement of the research is possible in the following areas:
1. Developement and implementation of view estimation methods by gait
videos;
2. Integration of view information into a recognition model for identification
quality improvement and increasing the view resistance;
92
3. Developement and implementation of multi-view RGB and event-based gait
video synthesis by applying the motion capture methods and generation of
3D data to enlarge the existing training and testing datasets.
4. Synthetic data application for multi-view recognition quality increase in
different types of data (RGB, event streams).
93
References
1. Maslow, A. Motivation and Personality / A. Maslow. –– Oxford, 1954. ––
411 p.
2. Cutting, J. E. Recognizing friends by their walk: Gait perception without fa­
miliarity cues / J. E. Cutting, L. T. Kozlowski // Bulletin of the Psychonomic
Society. –– 1977. –– Vol. 9, no. 5. –– P. 353––356.
3. Murray, M. P. GAIT AS A TOTAL PATTERN OF MOVEMENT /
M. P. Murray // American Journal of Physical Medicine & Rehabilitation. ––
1967. –– Vol. 46.
4. Lichtsteiner, P. A 128× 128 120 dB 15 µs Latency Asynchronous Temporal
Contrast Vision Sensor / P. Lichtsteiner, C. Posch, T. Delbruck // IEEE
Journal of Solid-State Circuits. –– 2008. –– Vol. 43, no. 2. –– P. 566––576.
5. 4.1 A 640×480 dynamic vision sensor with a 9µm pixel and 300Meps ad­
dress-event representation / B. Son [et al.] // 2017 IEEE International
Solid-State Circuits Conference (ISSCC). –– 02/2017. –– P. 66––67.
6. A 240 × 180 130 dB 3 µs Latency Global Shutter Spatiotemporal Vision
Sensor / C. Brandli [et al.] // IEEE Journal of Solid-State Circuits. –– 2014. ––
Oct. –– Vol. 49, no. 10. –– P. 2333––2341.
7. The TUM Gait from Audio, Image and Depth (GAID) database: Multimodal
recognition of subjects and traits. / M. Hofmann [et al.] // J. of Visual Com.
and Image Repres. –– 2014. –– Vol. 25(1). –– P. 195––206.
8. The OU-ISIR Gait Database Comprising the Large Population Dataset and
Performance Evaluation of Gait Recognition / H. Iwama [et al.] // IEEE
Trans. on Information Forensics and Security. –– 2012. –– Oct. –– Vol. 7,
Issue 5. –– P. 1511––1521.
9. Yu, S. A Framework for Evaluating the Effect of View Angle, Clothing and
Carrying Condition on Gait Recognition. / S. Yu, D. Tan, T. Tan // Proc.
of the 18’th ICPR. Vol. 4. –– 2006. –– P. 441––444.
94
10. Hofmann, M. Gait Recognition in the Presence of Occlusion: A New Dataset
and Baseline Algorithms / M. Hofmann, S. Sural, G. Rigoll // 19th Inter­
national Conferences on Computer Graphics, Visualization and Computer
Vision (WSCG). –– Plzen, Czech Republic, 01/2011.
11. Efficient Night Gait Recognition Based on Template Matching / Daoliang Tan
[et al.] // 18th International Conference on Pattern Recognition (ICPR’06).
Vol. 3. –– 08/2006. –– P. 1000––1003.
12. Gross, R. The CMU Motion of Body (MoBo) Database : tech. rep. / R. Gross,
J. Shi ; Carnegie Mellon University. –– Pittsburgh, PA, 06/2001.
13. Gait Recognition under Speed Transition / A. Mansur [et al.] //. –– 06/2014.
14. The OU-ISIR Gait Database Comprising the Treadmill Dataset / Y. Makihara
[et al.] // IPSJ Trans. on Computer Vision and Applications. –– 2012. ––
Apr. –– Vol. 4. –– P. 53––62.
15. The OU-ISIR Gait Database Comprising the Large Population Dataset with
Age and Performance Evaluation of Age Estimation / C. Xu [et al.] // IPSJ
Trans. on Computer Vision and Applications. –– 2017. –– Vol. 9, no. 24. ––
P. 1––14.
16. Lee, L. Gait Analysis for Recognition and Classification / L. Lee,
W. E. L. Grimson // Proc IEEE Int Conf Face Gesture Recognit. ––
06/2002. –– P. 148––155.
17. The HumanID gait challenge problem: data sets, performance, and analysis /
S. Sarkar [et al.] // IEEE Transactions on Pattern Analysis and Machine
Intelligence. –– 2005. –– Feb. –– Vol. 27, no. 2. –– P. 162––177.
18. Shutler, J. On a Large Sequence-Based Human Gait Database / J. Shut­
ler, M. Grant, J. Carter // Applications and Science in Soft Computing. ––
2004. –– Jan.
19. Silhouette Analysis-Based Gait Recognition for Human Identification /
L. Wang [et al.] // Pattern Analysis and Machine Intelligence, IEEE Trans­
actions on. –– 2004. –– Jan. –– Vol. 25. –– P. 1505––1518.
20. The OU-ISIR Large Population Gait Database with real-life carried object
and its performance evaluation / M. Z. Uddin [et al.] // IPSJ Transactions
on Computer Vision and Applications. –– 2018. –– May. –– Vol. 10, no. 1. ––
P. 5.
95
21. Nixon, M. S. Human identification based on gait / M. S. Nixon, T. Tan,
R. Chellappa //. Vol. 4. –– Springer Science & Business Media, 2010.
22. The effect of time on gait recognition performance / D. Matovski [et al.] //
International Conference on Pattern Recognition (ICPR). Vol. 7. –– 04/2012.
23. The AVA Multi-View Dataset for Gait Recognition / D. López-Fernández
[et al.] // Activity Monitoring by Multiple Distributed Sensing. –– Springer
International Publishing, 2014. –– P. 26––39. –– (Lecture Notes in Computer
Science).
24. Multi-view large population gait dataset and its performance evaluation for
cross-view gait recognition / N. Takemura [et al.] // IPSJ Trans. on Computer
Vision and Applications. –– 2018. –– Vol. 10, no. 4. –– P. 1––14.
25. Han, J. Individual Recognition Using Gait Energy Image / J. Han,
B. Bhanu // IEEE Trans. Pattern Anal. Mach. Intell. –– Washington,
DC, USA, 2006. –– Vol. 28, no. 2. –– P. 316––322.
26. Bashir, K. Gait recognition using gait entropy image / K. Bashir, T. Xiang,
G. S // Proceedings of 3rd international conference on crime detection and
prevention. –– 2009. –– P. 1––6.
27. Chen, J. Average Gait Differential Image Based Human Recognition.
J. Chen, J. Liu // Scientific World Journal. –– 2014.
/
28. Chrono-Gait Image: A Novel Temporal Template for Gait Recognition /
C. Wang [et al.] // Computer Vision – ECCV 2010. –– Berlin, Heidelberg :
Springer Berlin Heidelberg, 2010. –– P. 257––270.
29. Lam, T. H. Gait flow image: A silhouette-based gait representation for human
identification / T. H. Lam, K. Cheung, J. N. Liu // Pattern Recognition. ––
2011. –– Vol. 44, no. 4. –– P. 973––987.
30. Effective Part-Based Gait Identification using Frequency-Domain Gait En­
tropy Features / M. Rokanujjaman [et al.] // Multimedia Tools and
Applications. Vol. 74. –– 11/2013.
31. Gait Recognition Using a View Transformation Model in the Frequency Do­
main / Y. Makihara [et al.] // Computer Vision – ECCV 2006 / ed. by
A. Leonardis, H. Bischof, A. Pinz. –– Berlin, Heidelberg : Springer Berlin
Heidelberg, 2006. –– P. 151––163.
96
32. Dalal, N. Histograms of Oriented Gradients for Human Detection / N. Dalal,
B. Triggs // Proceedings of the 2005 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR’05) - Volume 1 - Vol­
ume 01. –– Washington, DC, USA : IEEE Computer Society, 2005. ––
P. 886––893. –– (CVPR ’05).
33. Multiple HOG templates for gait recognition. / Y. Liu [et al.] // Proceedings
of the 21st International Conference on Pattern Recognition (ICPR2012). ––
2012. –– P. 2930––2933.
34. Learning Realistic Human Actions from Movies / I. Laptev [et al.] // CVPR
2008 - IEEE Conference on Computer Vision & Pattern Recognition. ––
Anchorage, United States : IEEE Computer Society, 06/2008. –– P. 1––8.
35. Yang, Y. Gait Recognition Using Flow Histogram Energy Image / Y. Yang,
D. Tu, G. Li // 2014 22nd International Conference on Pattern Recogni­
tion. –– 2014. –– P. 444––449.
36. Cross-view gait recognition using joint Bayesian / C. Li [et al.] // Proc. SPIE
10420, Ninth International Conference on Digital Image Processing (ICDIP
2017). –– 2017.
37. Complete canonical correlation analysis with application to multi-view gait
recognition / X. Xing [et al.] // Pattern Recognition. –– 2016. –– Vol. 50. ––
P. 107––117.
38. Belhumeur, P. N. Eigenfaces vs. Fisherfaces: Recognition Using Class Spe­
cific Linear Projection / P. N. Belhumeur, J. P. Hespanha, D. J. Kriegman //
IEEE Trans. Pattern Anal. Mach. Intell. –– 1997. –– July. –– Vol. 19,
no. 7. –– P. 711––720.
39. Bashir, K. Gait recognition without subject cooperation / K. Bashir, T. Xi­
ang, S. Gong // Pattern Recognition Letters. –– 2010. –– Vol. 31. ––
P. 2052––2060.
40. Cross-view gait recognition using view-dependent discriminative analysis /
A. Mansur [et al.] // IEEE International Joint Conference on Biometrics. ––
2014. –– P. 1––8.
41. Joint Intensity and Spatial Metric Learning for Robust Gait Recognition /
Y. Makihara [et al.] // 2017 IEEE Conference on Computer Vision and Pat­
tern Recognition (CVPR). –– 07/2017. –– P. 6786––6796.
97
42. View-Invariant Discriminative Projection for Multi-View Gait-Based Human
Identification / M. Hu [et al.] // Information Forensics and Security, IEEE
Transactions on. –– 2013. –– Dec. –– Vol. 8. –– P. 2034––2045.
43. Muramatsu, D. View Transformation Model Incorporating Quality Measures
for Cross-View Gait Recognition / D. Muramatsu, Y. Makihara, Y. Yagi //
IEEE transactions on cybernetics. –– 2015. –– July. –– Vol. 46.
44. Muramatsu, D. Cross-view gait recognition by fusion of multiple transforma­
tion consistency measures / D. Muramatsu, Y. Makihara, Y. Yagi // IET
Biometrics. –– 2015. –– June. –– Vol. 4.
45. Arseev, S. Human Recognition by Appearance and Gait / S. Arseev,
A. Konushin // Programming and Computer Software. –– 2018. ––
P. 258––265.
46. Yoo, J.-H. Automated Markerless Analysis of Human Gait Motion for Recog­
nition and Classification / J.-H. Yoo // Etri Journal. –– 2011. –– Apr. ––
Vol. 33. –– P. 259––266.
47. Whytock, T. Dynamic Distance-Based Shape Features for Gait Recognition /
T. Whytock, A. Belyaev, N. M. Robertson // Journal of Mathematical Imag­
ing and Vision. –– 2014. –– Nov. –– Vol. 50, no. 3. –– P. 314––326.
48. Fusion of spatial-temporal and kinematic features for gait recognition with
deterministic learning / M. Deng [et al.] // Pattern Recognition. –– 2017. ––
Vol. 67. –– P. 186––200.
49. Castro, F. Pyramidal Fisher Motion for multiview gait recognition. / F. Cas­
tro, M. Marín-Jiménez, R. Medina-Carnicer // 22nd International Conference
on Pattern Recognition. –– 2014. –– P. 1692––1697.
50. Simonyan, K. Two-stream convolutional networks for action recognition in
videos. / K. Simonyan, A. Zisserman // NIPS. –– 2014. –– P. 568––576.
51. Beyond short snippets: Deep networks for video classification.
[et al.] // CVPR. –– 2015.
/ J. Ng
52. Feichtenhofer, C. Convolutional two-stream network fusion for video action
recognition. / C. Feichtenhofer, A. Pinz, A. Zisserman // CVPR. –– 2016.
98
53. Varol, G. Long-term Temporal Convolutions for Action Recognition /
G. Varol, I. Laptev, C. Schmid // IEEE Transactions on Pattern Analysis
and Machine Intelligence. –– 2017.
54. Wang, L. Action recognition with trajectory-pooled deep-convolutional de­
scriptors. / L. Wang, Y. Qiao, X. Tang // CVPR. –– 2015. –– P. 4305––4314.
55. DeepGait: A Learning Deep Convolutional Representation for Gait Recog­
nition / X. Zhang [et al.] // Biometric Recognition. –– Cham : Springer
International Publishing, 2017. –– P. 447––456.
56. DeepGait: A Learning Deep Convolutional Representation for View-Invariant
Gait Recognition Using Joint Bayesian / C. Li [et al.] // Applied Sciences. ––
2017. –– Feb. –– Vol. 7. –– P. 15.
57. VGR-Net: A View Invariant Gait Recognition Network / D. Thapar [et al.] //
IEEE 4th International Conference on Identity, Security, and Behavior Anal­
ysis (ISBA). –– 10/2017.
58. Wolf, T. Multi-view gait recognition using 3D convolutional neural networks /
T. Wolf, M. Babaee, G. Rigoll // 2016 IEEE International Conference on
Image Processing (ICIP). –– 09/2016. –– P. 4165––4169.
59. GEINet: View-invariant gait recognition using a convolutional neural net­
work / K. Shiraga [et al.] // 2016 International Conference on Biometrics
(ICB). –– 2016. –– P. 1––8.
60. A Comprehensive Study on Cross-View Gait Based Human Identification with
Deep CNNs / Z. Wu [et al.] // IEEE Transactions on Pattern Analysis and
Machine Intelligence. –– 2016. –– Mar. –– Vol. 39.
61. On Input/Output Architectures for Convolutional Neural Network-Based
Cross-View Gait Recognition / N. Takemura [et al.] // IEEE Transactions
on Circuits and Systems for Video Technology. –– 2019. –– Sept. –– Vol. 29,
no. 9. –– P. 2708––2719.
62. Siamese neural network based gait recognition for human identification /
C. Zhang [et al.] // IEEE International Conference on Acoustics, Speech and
Signal Processing. –– 2016. –– P. 2832––2836.
63. GaitSet: Regarding Gait as a Set for Cross-View Gait Recognition / H. Chao
[et al.] // AAAI. –– 2019.
99
64. Invariant feature extraction for gait recognition using only one uniform
model / S. Yu [et al.] // Neurocomputing. –– 2017. –– Vol. 239. –– P. 81––93.
65. GaitGAN: Invariant Gait Feature Extraction Using Generative Adversarial
Networks / S. Yu [et al.] // 2017 IEEE Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW). –– 2017. –– P. 532––539.
66. Learning view invariant gait features with Two-Stream GAN / Y. Wang
[et al.] // Neurocomputing. –– 2019. –– Vol. 339. –– P. 245––254.
67. Hochreiter, S. Long Short-Term Memory / S. Hochreiter, J. Schmidhuber //
Neural Comput. –– 1997. –– Nov. –– Vol. 9, no. 8. –– P. 1735––1780.
68. Gait Identification by Joint Spatial-Temporal Feature / S. Tong [et al.] //
Biometric Recognition. –– 2017. –– P. 457––465.
69. Feng, Y. Learning effective gait features using LSTM / Y. Feng, Y. Li,
J. Luo // International Conference on Pattern Recognition. –– 2016. ––
P. 325––330.
70. Memory-based Gait Recognition / D. Liu [et al.] // Proceedings of the
British Machine Vision Conference (BMVC). –– BMVA Press, 09/2016. ––
P. 82.1––82.12.
71. Pose-Based Temporal-Spatial Network (PTSN) for Gait Recognition with Car­
rying and Clothing Variations / R. Liao [et al.] // Biometric Recognition. ––
Cham : Springer International Publishing, 2017. –– P. 474––483.
72. Battistone, F. TGLSTM: A time based graph deep learning approach to gait
recognition / F. Battistone, A. Petrosino // Pattern Recognition Letters. ––
2019. –– Vol. 126. –– P. 132––138. –– Robustness, Security and Regulation
Aspects in Current Biometric Systems.
73. Event-Based Moving Object Detection and Tracking / A. Mitrokhin [et al.] //
2018 IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS). –– 10/2018. –– P. 1––9.
74. Spatiotemporal multiple persons tracking using Dynamic Vision Sensor /
E. Piątkowska [et al.] // 2012 IEEE Computer Society Conference on Com­
puter Vision and Pattern Recognition Workshops. –– 06/2012. –– P. 35––40.
100
75. Yuan, W. Fast localization and tracking using event sensors / W. Yuan, S. Ra­
malingam // IEEE International Conference on Robotics and Automation
(ICRA). –– 2016. –– P. 4564––4571.
76. Neuromorphic Vision Sensing for CNN-based Action Recognition / A. Chadha
[et al.] // ICASSP 2019 - 2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). –– 05/2019. –– P. 7968––7972.
77. Dynamic Vision Sensors for Human Activity Recognition / S. A. Baby
[et al.] // 2017 4th IAPR Asian Conference on Pattern Recognition
(ACPR). –– 11/2017. –– P. 316––321.
78. Bay, H. SURF: Speeded Up Robust Features / H. Bay, T. Tuytelaars,
L. Van Gool // Computer Vision – ECCV 2006. –– 2006. –– P. 404––417.
79. Event-driven body motion analysis for real-time gesture recognition / B. Kohn
[et al.] // 2012 IEEE International Symposium on Circuits and Systems (IS­
CAS). –– 05/2012. –– P. 703––706.
80. A Low Power, Fully Event-Based Gesture Recognition System / A. Amir
[et al.] // 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). –– 07/2017. –– P. 7388––7397.
81. Asynchronous frameless event-based optical flow / R. Benosman [et al.] //
Neural Networks. –– 2012. –– Vol. 27. –– P. 32––37.
82. Zhu, A. EV-FlowNet: Self-Supervised Optical Flow Estimation for Even­
t-based Cameras / A. Zhu [et al.] // Proceedings of Robotics: Science and
Systems. –– Pittsburgh, Pennsylvania, 06/2018.
83. Wang, X. Intelligent multi-camera video surveillance: A review / X. Wang //
Pattern Recognition Letters. –– 2013. –– Vol. 34, no. 1. –– P. 3––19. ––
Extracting Semantics from Multi-Spectrum Video.
84. Large-scale video classification with convolutional neural networks.
A. Karpathy [et al.] // CVPR. –– 2014.
/
85. Burton, A. Thinking in Perspective: Critical Essays in the Study of Thought
Processes / A. Burton, J. Radford. –– Methuen, 1978. –– 232 p.
86. Warren, D. Electronic Spatial Sensing for the Blind: Contributions from Per­
ception, Rehabilitation, and Computer Vision / D. Warren, E. R. Strelow. ––
Netherlands, 1985. –– 521 p.
101
87. Farneback, G. Two-frame motion estimation based on polynomial expan­
sion. / G. Farneback // Proc. of Scandinavian Conf. on Image Analysis.
Vol. 2749. –– 2003. –– P. 363––370.
88. OpenCV (Open Source Computer Vision Library). –– URL: https://opencv.
org.
89. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields / Z. Cao
[et al.] // CVPR. –– 2017.
90. Automatic Learning of Gait Signatures for People Identification / F. M. Cas­
tro [et al.] // Advances in Computational Intelligence. –– Cham : Springer
International Publishing, 2017. –– P. 257––270.
91. Return of the devil in the details: Delving deep into convolutional nets. /
K. Chatfield [et al.] // Proc. BMVC. –– 2014.
92. Ioffe, S. Batch normalization: Accelerating deep network training by reducing
internal covariate shift. / S. Ioffe, C. Szegedy // ICML. –– 2015.
93. Krizhevsky, A. ImageNet Classification with Deep Convolutional Neural Net­
works / A. Krizhevsky, I. Sutskever, G. Hinton // Neural Information
Processing Systems. –– 2012. –– Jan. –– Vol. 25.
94. Dropout: A Simple Way to Prevent Neural Networks from Overfitting /
N. Srivastava [et al.] // Journal of Machine Learning Research. –– 2014. ––
Vol. 15. –– P. 1929––1958.
95. Chen, T. Net2net: Accelerating learning via knowledge transfer. In Inter­
national Conference on Learning Representation. / T. Chen, I. Goodfellow,
J. Shlens // ICLR. –– 2016.
96. Glorot, X. Understanding the difficulty of training deep feedforward neural
networks / X. Glorot, Y. Bengio // Proceedings of the Thirteenth Interna­
tional Conference on Artificial Intelligence and Statistics. Vol. 9 / ed. by
Y. W. Teh, M. Titterington. –– Chia Laguna Resort, Sardinia, Italy : PMLR,
05/2010. –– P. 249––256. –– (Proceedings of Machine Learning Research).
97. Simonyan, K. Very deep convolutional networks for large-scale image recog­
nition. / K. Simonyan, A. Zisserman // ICLR. –– 2015.
102
98. Triplet probabilistic embedding for face verification and clustering /
S. Sankaranarayanan [et al.] // 2016 IEEE 8th International Conference
on Biometrics Theory, Applications and Systems (BTAS). –– 2016. ––
P. 1––8.
99. Small medical encyclopedia //. Vol. 4 / ed. by V. Pokrovskiy. –– 1996. ––
P. 570.
100. Convolutional pose machines / S.-E. Wei [et al.] // CVPR. –– 2016.
101. Deep Residual Learning for Image Recognition / K. He [et al.] // 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). ––
2015. –– P. 770––778.
102. Zagoruyko, S. Wide Residual Networks / S. Zagoruyko, N. Komodakis //
Proceedings of the British Machine Vision Conference (BMVC) / ed. by
E. R. H. Richard C. Wilson, W. A. P. Smith. –– BMVA Press, 09/2016.
103. Lasagne: First release. / S. Dieleman [et al.]. –– 08/2015. –– URL: http:
//dx.doi.org/10.5281/zenodo.27878.
104. Castro, F. M. Multimodal Features Fusion for Gait, Gender and Shoes Recog­
nition / F. M. Castro, M. J. Marín-Jiménez, N. Guil // Mach. Vision Appl. ––
2016. –– Nov. –– Vol. 27, no. 8. –– P. 1213––1228.
105. Deep multi-task learning for gait-based biometrics / M. Marín-Jiménez
[et al.] // IEEE International Conference on Image Processing (ICIP). ––
09/2017.
106. Guan, Y. A robust speed-invariant gait recognition system for walker and
runner identification / Y. Guan, C.-T. Li. –– 2013. –– Jan.
107. Cross-view gait recognition using view-dependent discriminative analysis /
A. Mansur [et al.] // IJCB 2014 - 2014 IEEE/IAPR International Joint Con­
ference on Biometrics. –– 12/2014.
108. PyTorch: An Imperative Style, High-Performance Deep Learning Library /
A. Paszke [et al.] // Advances in Neural Information Processing Systems 32. ––
Curran Associates, Inc., 2019. –– P. 8024––8035.
109. Lichtsteiner, P. A 128×128 120 dB 15µs Latency Asynchronous Temporal
Contrast Vision Sensor / P. Lichtsteiner, C. Posch, T. Delbruck // IEEE
Journal of Solid-State Circuits. –– 2008. –– Vol. 43, no. 2. –– P. 566––576.
103
110. Newell, A. Stacked Hourglass Networks for Human Pose Estimation /
A. Newell, K. Yang, J. Deng // ECCV. –– 2016.
111. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis /
M. Andriluka [et al.] // CVPR. –– 2014. –– P. 3686––3693.
112. The event-camera dataset and simulator: Event-based data for pose estima­
tion, visual odometry, and SLAM / E. Mueggler [et al.] // The International
Journal of Robotics Research. –– 2017. –– Vol. 36, no. 2. –– P. 142––149.
104
List of Figures
0.1
1.1
1.2
1.3
1.4
1.5
1.6
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
3.1
3.2
3.3
3.4
3.5
Informal problem statement: having a video with a walking person one
needs to determine this person’s identity from the database . . . . . . .
Examples of factors affecting the gait itself: using a supporting cane
(a), wearing high heels (b) and carrying conditions: bag (c), backpack
(d) or other loads (e) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Examples of factors affecting the gait representation: wearing an overall
(a), long dress (b) or coat (c), occlusions (d), and bad lightening and
low quality of the frame (e) . . . . . . . . . . . . . . . . . . . . . . . . .
Examples of frames from TUM-GAID database . . . . . . . . . . . . . .
Examples of frames from CASIA Gait Dataset B . . . . . . . . . . . . .
Examples of frames from OU-ISIR Large Population Dataset . . . . . .
Examples of basic gait descriptors . . . . . . . . . . . . . . . . . . . . .
The same frame with different brightness, contrast and saturation settings
Optical flow as a vector field of motion between consecutive frames . . .
General algorithm pipeline . . . . . . . . . . . . . . . . . . . . . . . . .
The visualizations of OF components and the constructed image . . . .
Input block. The square box cropped from 10 consecutive frames of
horizontal (the first and the second rows) and vertical (the third and
the fourth rows) components of normalized OF maps . . . . . . . . . . .
The scheme of CNN-M fusion architecture . . . . . . . . . . . . . . . .
The pipeline of the algorithm based on blocks of optical flow maps . . .
Side-view frames from TUM and CASIA gait databases . . . . . . . . .
Gait independent motions of hands and head that are not informative .
The visualization of optical flow between two silhouette masks (two
channels are the components of OF and the third one is set to zero) . .
The pipeline of the pose-based gait recognition algorithm . . . . . . . .
The bounding boxes for human body parts: right and left feet (small
blue boxes at the bottom), upper and lower body (green boxes in the
middle) and full body (big red box) . . . . . . . . . . . . . . . . . . . .
The visualizations of optical flow maps in five areas of the human figure
6
13
13
14
15
15
20
27
27
28
29
31
36
38
43
46
47
48
49
50
105
3.6
3.7
3.8
4.1
4.2
4.3
4.4
4.5
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
The frame with human figure bounding box (red) and the border of
probable crops (green) depicted and the examples of patches that can
be cropped from this one frame . . . . . . . . . . . . . . . . . . . . . .
Consturction of residual blocks in Wide ResNet architecture . . . . . . .
ROC (the first row) and CMC (the second row) curves under 85°
gallery view and 55°, 65°, and 75° probe view, respectively . . . . . . . .
The example of the same walk captured under different viewing angles .
The desired cluster structure of the data . . . . . . . . . . . . . . . . .
The pipeline of gait recognition algorithm with two modifications (view
loss and CV TPE) shown in green boxes . . . . . . . . . . . . . . . . .
The curve of the training loss during three steps of training process . . .
The desired mutual arrangement of the elements in a triplet . . . . . . .
The transformation of the event stream into a sequence of frames . . . .
The examples of event-based image(left) and blob detection result (right)
The pipeline of the event-based gait recognition algorithm . . . . . . . .
Examples of the skeletons estimated by fine-tuned Stacked Hourglass
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Human figure bounding boxes after blob detection (left) and after
refinement by pose estimation (right) . . . . . . . . . . . . . . . . . . .
Two components of event-based optical flow maps . . . . . . . . . . . .
The visualizations of the real and simulated event streams . . . . . . . .
The ROC curves corresponding to three considered models and the
original RGB model: fullydepicted (left) and large-scale (right) to show
the difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
53
61
66
67
68
70
71
78
79
79
82
83
84
86
89
106
List of Tables
1.1
Gait recognition datasets . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.1
2.2
2.3
2.4
2.5
2.6
The baseline architecture . . . . . . . . . . . . . . . . . . . . . . . . . .
The VGG-like architecture . . . . . . . . . . . . . . . . . . . . . . . . .
Results on TUM-GAID dataset . . . . . . . . . . . . . . . . . . . . . .
Quality of recognition on TUM-GAID dataset under different conditions
The quality of transfer learning . . . . . . . . . . . . . . . . . . . . . .
The quality of the jointly learned model . . . . . . . . . . . . . . . . . .
34
36
41
42
43
44
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
.
.
.
.
.
.
.
.
52
56
57
57
58
59
60
61
.
.
.
.
62
63
64
64
Comparison of average cross-view recognition accuracy of proposed
approaches and state-of-the-art models . . . . . . . . . . . . . . . . . .
75
The quality of the end-to-end model on simulated TUM dataset . . . . .
The quality of the perfect detected model on simulated TUM dataset . .
Results of verification on TUM-GAID dataset . . . . . . . . . . . . . . .
87
88
89
Wide Residual Network architecture . . . . . . . . . . . . . . . . . . .
Recognition quality on TUM-GAID dataset . . . . . . . . . . . . . . .
Recognition quality on TUM-GAID dataset under different conditions
Recognition accuracy on CASIA lateral part . . . . . . . . . . . . . .
Recognition quality on CASIA lateral part under different conditions .
Average recognition rates for three angles on CASIA database . . . . .
Comparison of Rank-1 and EER on OULP dataset . . . . . . . . . . .
Comparison of Rank-1 on OULP dataset following five-fold cv protocol
Recognition quality of models based on different sets of body parts on
TUM-GAID dataset . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10 Recognition quality for different video lengths on TUM-GAID dataset
3.11 Recognition quality for different video lengths on CASIA-B dataset . .
3.12 The quality of transfer learning . . . . . . . . . . . . . . . . . . . . .
4.1
5.1
5.2
5.3
Download