National Research University Higher School of Economics as a manuscript Sokolova Anna Neural network model for human recognition by gait in different types of video PhD Dissertation for the purpose of obtaining academic degree Doctor of Philosophy in Computer Science Academic Supervisor: PhD, Associate Professor Anton S. Konushin Moscow — 2020 Федеральное государственное автономное образовательное учреждение высшего образования «Национальный исследовательский университет «Высшая школа экономики» На правах рукописи Соколова Анна Ильинична Нейросетевая модель распознавания человека по походке в видеоданных различной природы ДИССЕРТАЦИЯ на соискание ученой степени кандидата компьютерных наук Научный руководитель: кандидат физико-математических наук, доцент Конушин Антон Сергеевич Москва –– 2020 3 Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1. Background . . . . . . . . . . . . . . . 1.1 Influential factors . . . . . . . . . . 1.2 Gait recognition datasets . . . . . . 1.3 Literature review . . . . . . . . . . 1.3.1 Manual feature construction 1.3.2 Automatic feature training . 1.3.3 Event-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 14 17 17 22 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 26 28 29 31 37 39 40 40 44 3. Pose-based gait recognition . . . . . . . . . . . . . 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . 3.2 Proposed algorithm . . . . . . . . . . . . . . . . . 3.2.1 Body part selection and data preprocessing 3.2.2 Neural network pipeline . . . . . . . . . . 3.2.3 Feature aggregation and classification . . . 3.3 Experimental evaluation . . . . . . . . . . . . . . 3.3.1 Performance evaluation . . . . . . . . . . . 3.3.2 Experiments and results . . . . . . . . . . 3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 46 47 48 50 53 55 55 55 64 4. View resistant gait recognition . . . . . . . . . . . . . . . . . . . . . 66 2. Baseline model . . . . . . . . . . . . 2.1 Motivation . . . . . . . . . . . . . 2.2 Proposed algorithm . . . . . . . . 2.2.1 Data preprocessing . . . . 2.2.2 Neural network backbone . 2.2.3 Feature postprocessing and 2.3 Experimental evaluation . . . . . 2.3.1 Evaluation metrics . . . . 2.3.2 Experiments and results . 2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4 4.1 4.2 4.3 4.4 4.5 Motivation . . . . . . . . . . View Loss . . . . . . . . . . Cross-view triplet probability Experimental evaluation . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 68 71 73 74 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 76 77 79 80 81 83 84 86 89 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5. Event-based gait recognition . 5.1 Dynamic Vision Sensors . . . 5.2 Data visualization . . . . . . . 5.3 Recognition algorithm . . . . . 5.3.1 Human figure detection 5.3.2 Pose estimation . . . . 5.3.3 Optical flow estimation 5.4 Event-based data . . . . . . . 5.5 Experimental evaluation . . . 5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5 Introduction According to Maslow studies [1], safety need is one of the basic and fundamental human needs. People tend to protect themselves, preserve their housing from illegal invasion and save property from stealing. With the development of modern video surveillance systems, it becomes possible to capture everything that happens in a certain area and then analyze the obtained data. Using video recordings, one can track the movements of people, determine illegal entry into private territory, identify criminals who get captured by the cameras, control access to restricted objects. For example, video surveillance systems help to catch burglars, robbers or arsonists, automatically count the number of people in a line or in crowd, and analyze the character of their movements reducing the amount of subjective human intervention and decreasing the time required for data processing. Besides this, being embedded in the currently widely used home assistance systems (“smart home”), they can distinguish family members and change behavior depending on the personality. For example, it can be configured to conduct different actions if it captures a child or elderly person. Dissertation topic and its relevance Recently, the problem of recognizing a person in a video (see Fig. 0.1) has become particularly urgent. A human’s personality is identifiable in a video based on several criteria, and the most accurate one is facial features. However, the current recognition quality allows to entrust decision-making to a machine only in a cooperative mode, when person’s face is compared with a high quality photograph in a passport. In real life (especially when committing crimes), a person’s face may be hidden or poorly visible due to bad view, insufficient lighting, or the presence of a headgear, mask, makeup, etc. In this case, another characteristic is required to make the recognition, and gait is the possible one. According to biometric and physiological studies [2; 3], the manner the person walks is individual and can not be falsified, which makes gait a unique identifier comparable to fingerprints or the iris of the eyes. In addition, unlike these “classic” methods of identification, gait can be observed at a great distance, it does not require perfect resolution of the video and, most importantly, no direct cooperation with a human is needed, thus, a human 6 may not know that he is being captured and analyzed. So, in some cases gait serves as the only possible sign for determining a person in the video surveillance data. Figure 0.1 — Informal problem statement: having a video with a walking person one needs to determine this person’s identity from the database The gait recognition problem is very specific due to the presence of many factors that change the gait visually (different shoes; carried objects; clothes that hide parts of the body or constrain movements) or affect the internal representation of the gait model (angle, various camera settings). In this regard, the quality and reliability of identification by gait is much lower than by face, and, despite the success of modern methods of computer vision, this problem has not been solved yet. Many methods are applicable solely to the conditions present in the databases on which they are trained, which limits their usability in real life. In addition to the classic surveillance cameras that store everything that happens in the frame 25 − 30 times per second, other types of sensors, in particular, dynamic vision sensors (Dynamic Vision Sensors, DVS [4—6]), are gaining popularity in recent years. Unlike conventional video cameras, the sensor, like the retina, captures changes in intensity in each pixel, ignoring points with constant brightness. Under the conditions of a static sensor, events at background points are very rarely generated, preventing the storage of redundant information. At the same time, the intensity at each point is measured several thousand times per second, which leads to the asynchronous capture of all important changes. As a result, such a stream of events turns out to be very informative and suitable for obtaining the data necessary for solving many video analysis tasks that require the extraction of dynamic characteristics, including gait recognition. Dynamic vision sensors are now a promising rapidly developing technology, which leads to the need to solve video analysis tasks for the received data. Despite the constant development of computer vision methods, no approaches to solving the 7 gait recognition problem according to the data of dynamic vision sensors have yet been proposed, and it represents a vast field for research. The methods of deep learning based on the training of neural networks have become the most successful in solving computer vision problems in recent years. Attributes taught by neural networks often have a higher level of abstraction, which is necessary for high-quality recognition. This allows to achieve outstanding results in solving such problems as the classification of video and images, image segmentation, object detection, visual tracking, etc. However, despite the success of deep learning methods, classical computer vision methods are still ahead of neural networks in some gait recognition benchmarks and both approaches have not achieved acceptable accuracy for full integration yet. Goals and objectives of the research This research aims to develop and implement the neural network algorithm for human identification in video based on the motion of the points of human figure that is stable to viewing angle changes, different clothing and carrying conditions. To achieve this goal the following tasks are set: 1. Develop and implement the algorithm for human identification in video analysing the optical flow. 2. Develop an implement the multiview algorithm for gait recogntion based on the analysis of the motion of different human body parts. 3. Develop the algorithm for human recognition in the event-based data from Dynamic Vission Sensors. Formal problem statement The formal research objects are video surveillance frame sequences 𝑣 and event streams 𝑒 from the dynamic vision sensors where the moving person is captured. Having the labelled gallery 𝐷 given one needs to deternime, which person from the gallery appears in video, i.e. the identity of the person in video is to be defined. Let the gallery be defined as 𝐷 = {(𝑣𝑖 , 𝑥𝑖 )}𝑁 𝑖=1 , 𝑥𝑖 ∈ 𝑃, where 𝑁 is the number of sequences and 𝑃 is the set of subjects. The label of the subject 𝑥 ∈ 𝑃 is to be found for the video 𝑣 under investigation. The goal is to construct some similarity measure 𝑆 according to which the closest object will be 8 searched for in the gallery. 𝑆(𝑣,𝑣𝑖 ) → min 𝑣𝑖 : ∃𝑥𝑖 : (𝑣𝑖 ,𝑥𝑖 )∈𝐷 The problem for the event streams is stated similarly. A set of restrictions is imposed on all the sequences: – each video in the gallery contains exactly one person; – each person is captured full length; – no occlusions; – static camera; – the set of posible camera settings (its height and tilt) is limited. The described conditions are introduced due to the limitations of the existing datasets and benchmarks. Novelty and summary of the authors main results In this thesis, the author introduces the original method for human recognition by gait stable to view changes, reducing the length of the video sequences and dataset transfer. The following is the list of the main research results. The list of the corresponding publications can be found in section Publications at the page 9. 1. Side-view gait recognition method which analyses the points translations between consecutive video frames is proposed and implemented. The method shows state-of-the-art quality on the side-view gait recognition benchmark. 2. Multi-view gait recognition method based on the consideration of movements of the points in different areas of human body is proposed and implemented. The state-of-the-art recognition accuracy is achieved for cerrain viewing angles and the best at the investigation time approaches are outperformed in verification mode. 3. The influence of the point movements in different body parts on the recognition quality is revealed. 4. Gait recongition method stable to dataset transfer is proposed and implemented. 5. Two approaches for view resistance improvement are proposed and implemented. Both methods increase the cross-view recognition quality and complement each other being applied simultaneously. The model 9 obtained using these approaches outperforms the state-of-the-art methods on multi-view gait recogntion benchmark. 6. The method for human recognition by motion in the event-based data from dynamic vision sensor is proposed and implemented. The quality close to conventional video recognition is obtained. The described results are original and obtained for the first time. Below, the author’s contributions are summarized in four main points. 1. The first gait recognition method based on the investigation of the point movements in different parts of the body is proposed and implemented. 2. Two original methods of view resistance improvements are proposed. In these approaches the auxiliary model regularization is made and the descriptors are projected into the special feature space decreasing the view dependency. 3. The original research of the gait recognition algorithm transfer between different data collections is made. 4. The first method for human recognition by motion in the event-based data from dynamic vision sensor is propose and implemented. Practical significance Gait recognition is an applied problem of computer vision. Being proposed according to natural and mathemathical reasons, all the suggested methods and approaches aim to be applicable. Thus, being implemented, the proposed human identification methods can be intergrated to different automation systems. For example, the developed approach can be used in the home assistance systems (“smart home”) which recognize the family members and change the behaviour depending on the captured person. Being united with the alarm, the system can respond to the appearance of the people not included to the family, and track the illegal entrance into the private houses. Besides this, the gait identification algorithm can be used in crowded places, such as train stations and airports, where it is not possible to take close-up shots, but there is an obvious need to track and control access. Publications and approbation of the research 10 Main results of this thesis are published in the following papers. The PhD candidate is the main author in all of these articles. First-tier publications: – A. Sokolova, A. Konushin, Pose-based deep gait recognition // IET Biometrics, 2019, (Scopus, Q2). Second-tier publications: – A. Sokolova, A. Konushin, Gait recognition based on convolutional neural networks // International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences, 2017 (Scopus). – A. Sokolova, A. Konushin, Methods of gait recognition in video // Programming and Computer Software, 2019 (WoS, Scopus, Q3). – A. Sokolova, A. Konushin, View Resistant Gait Recognition // ACM International Conference Proceeding Series, 2019 (Scopus). The results of this thesis have been reported at the following conferences and workshops: – ISPRS International workshop “Photogrammetric and computer vision techniques for video surveillance, biometrics and biomedicine” – PSBB17, Moscow, Russia, May 15 – 17, 2017. Talk: Gait recognition based on convolutional neural networks. – Computer Vision and Deep Learning summit “Machines Can See”, Moscow, Russia, June 9, 2017. Poster: Gait recognition based on convolutional neural networks. – 28th International Conference on Computer Graphics and Vision “GraphiCon 2018”, Tomsk, Russia, September 24 – 27, 2018. Talk: Review of video gait recognition methods. – Samsung Biometric Workshop, Moscow, Russia, April 11, 2019. Talk: Human identification by gait in RGB and event-based data. – 16th International Conference on Machine Vision Applications (MVA), Tokyo, Japan, May 27 – 31, 2019. Poster: Human identification by gait from event-based camera. – 3rd International Conference on Video and Image Processing (ICVIP), Shanghai, China, December 20 – 23, 2019. Talk: View Resistant Gait Recognition (best presentation award). 11 Thesis outline. The thesis consists of the introduction, background, four chapters corresponding to developed approaches and the conclusion. 12 1. Background In this chapter, the current state of affairs in gait recognition field is described. Prior to description of existing methods, the datasets available for gait recognition research are listed. There exists a large number of different gait collections, however they hardly cover the conditions that can influence the gait representation. The large variability of affecting factors is the main problem of gait recognition challenge, thus, we firstly describe these factors, then list the databases including the variability of some of these factors and finish the chapter with the overview of existing approaches. 1.1 Influential factors Similarly to any computer vision problem, in gait recognition there are a lot of factors that do not change the content of the video but change the appearance and, therefore, influence the inner representation of the gait. All these factors can be divided into two groups: affecting the gait itself and affecting only the representation. The factors from the first group really change the manner of walk making the recognition more complex for both human and computer. The second set contains the conditions that change only the inner representation while the gait remains the same. It can be obvious for the human that there is the same person in two videos, but for the computer there are two absolutely different video sequences with different features. Let us start with the first group. These factors include the external conditions, such as different footwear (that can be tight or uncomfortable, or have high heels) changing the gait or carrying conditions (bags, backpacks or any heavy load) that occupy hands and make the walk heavier. Besides this, there are inner conditions changing the gait as, for example, illness (injury, lameness, fatigue, etc.) or drunkeness. One more inner condition is long time interval between recordings. Even if nothing has happend since the first record, if a long time has passed before the next recording the gait can change a bit. Figure 1.1 The second set of the influential factorsillustrated in Fig. 1.2 do not change the walking manner, but change its representation. These factors are more cunning, since they can be imperceptible for human eye, but powerfull for machine. They include view variation which is the most challenging problem for gait recognition, different clothing that can hide some parts of the body and change the shape of human figure 13 (a) (b) (c) (d) (e) Figure 1.1 — Examples of factors affecting the gait itself: using a supporting cane (a), wearing high heels (b) and carrying conditions: bag (c), backpack (d) or other loads (e) (for example, long dresses, coats or puffer jackets), various occlusions, and change of lightening and camera settings. A human can even not notice the difference, but it is a big challenge to train a model stable to all these variations. All the decribed factors can be combined making the recognition problem even more complex (see Fig. 1.1 (b) with long dress and heels, Fig. 1.2 (d) with dresses and occlusions or (e) with a coat and carrying conditions, and all the images 1.1, 1.2 having different views). (a) (b) (c) (d) (e) Figure 1.2 — Examples of factors affecting the gait representation: wearing an overall (a), long dress (b) or coat (c), occlusions (d), and bad lightening and low quality of the frame (e) Due to such great variability of influential conditions, large general datasets are hardly collectable. Different researchers pay their attention on different factors ignoring the others, which leads to overfitting of the models to the exact conditions presented in the database. The next section describes the datasets available to the time of the study and the conditions presented in them. One can see, that even the 14 largest databases do not cover all the conditions leaving a large field for further research. 1.2 Gait recognition datasets At the time of the study, the most widely used complex datasets for gait recognition are “TUM Gait from Audio, Image and Depth” (TUM-GAID) [7], OU­ ISIR Large Population Dataset (OULP) [8] and CASIA Gait Dataset B [9]. These databases are not the only existing collections for gait recognition, but the most popular due to their size and condition variability, thus they are the datasets used in this study. In this section, these collections are described as well as other gait recognition databases. The first and the most simple database, TUM-GAID, is used for side-view recognition (see Fig. 1.3). It contains videos for 305 subjects going from left to right and from right to left captured from the lateral view. The frame rate of videos is about 30 fps and the resolution of each frame is 320 × 240. TUM-GAID contains videos with different walking conditions: 6 normal walks, 2 walks carrying a backpack, and 2 walks wearing coating shoes for each person. Although there are recordings only from one viewpoint, the number of subjects allows to take around a half of the people as training set and another half for evaluation. The authors of the database have provided the split into training, validation, and testing sets, so that the models can be trained using the data for 150 subjects and tested with the other 155. Hence, no randomness influenced the result. Being full-color, this collection it applicable in a large number of experiments with different approaches. In addition, in this database there are videos taken with six month interval, which makes it possible to check the stability of the algorithms to temporal gait changes. normal walk normal walk coating shoes backpack Figure 1.3 — Examples of frames from TUM-GAID database The second dataset, CASIA B, is relatively small in terms of subjects number, but it has a large variation of views: 124 subjects are captured from 11 different 15 viewing angles (from 0° to 180° with the step 18°). The videos last 3 to 5 seconds, the frame rate is about 25 fps, and the resolution equals 320 × 240 pixels. All the trajectories are straight and recorded indoor in the same room for all people. Besides the variety of the viewpoints, similarly to the first collection, different clothing and carrying conditions are presented: wearing a coat and carrying a bag. In total, 10 video sequences are available for each person captured from each view: 6 normal walks without extra conditions, two walks in an outdoor clothing and two walks with a bag. CASIA dataset is very popular due to view variability. Nevertheless, there are too few subjects to train a deep neural network to recognize gait from all views without overfitting. The examples of video frames from CASIA B collection are shown in Fig. 1.4. normal walk normal walk outerwear bag Figure 1.4 — Examples of frames from CASIA Gait Dataset B The third one, OULP set, consists of video sequences for more than 4,000 people taken with two cameras, and the shooting angle varies smoothly from 55° to 85°. The data from this collection is distributed in the form of silhouette masks, therefore not all the described methods are applicable to this database. The silhouettes are binarized and normalized to size 88 × 128, as shown in Fig. 1.5. Nevertheless, since many of the developed gait recognition methods are estimated on this database, it is used for comparison of the approaches, as well. Figure 1.5 — Examples of frames from OU-ISIR Large Population Dataset 16 However, there exist a lot of other gait databases collected for experiments in different conditions, that have to be mentioned for completeness, as well. In Technical University of Munich not only TUM-GAID dataset was collected, but another one, TUM IIT-KGP [10] in collaboration with Indian Institute of Technology Kharagpur. This database is rather small containing 35 subjects, but the walks are captured in 6 different conditions including dynamic and static occlusions (with the objects that move or do not move, respectively). Thus, TUM IIT-KGP addresses the occlusion challenge which is complicating but frequently occurs in real life. One more factor varying in some of the databases is walking speed. For example, CASIA Gait database C [11] and CMU Motion of body dataset (CMU MoBo [12]) contain normal, slow and fast walks, and OU-ISIR Gait Speed Transition Dataset (OUST) [13] consists of the videos of walks where the speed gradually changes: accelarates from 1km/h to 5km/h and decelerates from 5km/h to 1 km/h. Another database collected in Osaka University, OU-ISIR Treadmill (OUTM) dataset A [14] contains walks made on treadmill with the speed variation from 1km/h to 10km/h. Similarly to OULP [8], all the OU-ISIR collections are dictributed in a form of binary masks which restricts their usage for experiments. Osaka University researchers have also collected the age varying dataset, OU­ ISIR Gait Database, Large Population Dataset with Age (OULP + Age) [15] where the age of the subjects is distributed from 2 to 90 years. The investigations show that the walking manner can change depending on the human’s age, thus, it is worth verifying if the correlation really exists. To examine, if the gait can change during the long period of time, the time varying datasets are collected. Similary to TUM-GAID where 32 subjects are captured with 6 month time interval, MIT AI collection [16] and USF HumanID dataset [17] contain different types of variations including time interval between recordings. The latter one contains 1870 sequences obtained from 122 subjects with 32 condition variotaion, thus, the size of the database is too large and it is distibuted only on physical USB drive, which makes its distribution very long and complicated process. It is worth noting one more variation included in some of the datasets: terrain variation. Although the background is usually static in gait recognition collections, the environment can influence the representation, thus, different terrains can prevent the overfitting on the dataset. The datasets containing such variations are USF HumanID dataset and SOTON Large database [18]. 17 The collections described above are not the only available datasets, many other databases [19—23] address the problem of multi-view recognition, clothing variation and carrying conditions. Most of them are quite small which prevents their usage for neural network training, others (such as OU-ISIR Multi-View Large Population Dataset (OU MVLP) [24]) have been collected recently, after the proposed approach was developed, thus they also have not been used in the experiments. Nevertheless, these databases influence the developement of gait recognition methods, so, they are worth mentioning. Table 1.1 shows some details of the video datasets available for gait recognition. 1.3 Literature review Similarly to most of computer vision problems, there are currently two approaches to obtaining gait features from the source data: the hand-crafted construction of the descriptors and their automatic training. The first method is more traditional and is usually based on the calculation of various properties of silhouette binary masks or on the study of joints positions, their relative distances and speeds and other kinetic parameters. Feature training is usually made by artificial neural networks that have become very popular in recent years due to outstanding results in many computer vision problems, such as video and image classification, image segmentation, object detection, visual tracking, and others. Features that are trained using neural networks often have a higher level of abstraction which is necessary for high-quality recognition. The best quality of identification is achieved by methods combining these two approaches. Initially, the basic characteristics of the gait are computed manually, and then they are fed to a neural network to extract more abstract features. However, despite the success of deep methods, there are non-deep methods outperformung the neural networks on some of the benchmarks, thus both global approaches are worthy of attention. 1.3.1 Manual feature construction Let us firstly discuss some classical basic approaches in which the gait descriptors are extracted manually for natural reasons. 18 Table 1.1 — Gait recognition datasets Database Num. of subjects Num. of Environmental sequences settings Variations TUM-GAID [7] 305 3370 Indoor Directions(2), load(2), time(2), footwear(2), clothing(2) TUM-IITKGP [10] 35 840 Indoor Directions(2), load(2), occlusion CASIA A [19] 20 240 Outdoor Views(3) CASIA B [9] 124 13640 Indoor Views(11), clothing(2), load(2) CASIA C [11] 153 1530 Outdoor, thermal camera Speed(3), load(2) OUTM A [14] 34 612 TM, Indoor Speed(9) OUTM B [14] 68 2176 TM, Indoor Clothing(32) OUTM C [14] 200 5000 TM, Indoor Views(25) OUTM D [14] 185 370 TM, Indoor Gait fluctuations OULP [8] 4007 7842 Indoor Views(4) OUST [13] 179 n/a Indoor, TM/static floor Speed shift OULP + Age [15] 63,846 63,846 Indoor Age (from 2 to 90) OU MVLP [24] 10307 n/a Indoor Views(14) AVAMVG [23] 20 1200 Indoor Directions(2), views(2) CMU MoBo [12] 25 600 TM, Indoor View(6), speed(2), inclination(2), load(2) USF [17] 122 1870 Outdoor View(2), terrain(2), footwear(2), load(2), time(2) MIT AI [16] 24 194 Indoor Views(4), time SOTON Large [18] 114 2128 Indoor/outdoor, TM/static floor View(2), terrain(3) SOTON Temporal [22] 25 26160 Indoor View(12), clothing(2), time(5), footwear(2) 19 Binary silhouettes The most common gait characteristic is the gait energy image (GEI) [25], the average over one gait cycle of binary silhouette masks of moving person. Such spatial­ temporal description of human gait is computed under the assumption that the movements of the body are periodically repeated during the walking. The resulting images characterize the frequency of a moving person being in a particular position. This approach is widely used and many other gait recognition methods are based on it. Besides this, many approaches not using gait energy images directly, suggest similar aggregation of other basic features. For example, the recognition can be made by the gait entropy images (GEnI) [26], where the entropy of each pixel is calculated instead of silhouette averaging, or by the frame difference energy image (FDEI) [27] reflecting the differences between silhouettes in consecutive frames of video. The visualizations of binary silhouette mask and the images of gait energy and entropy are shown in Fig. 1.6. The problem of all these methods averaging frame features is the loss of temporal information, since the order of the frames is not taken into account. Several approaches [28—30] aim at preventing this loss. Thus, Chrono-Gait Images (CGI) [28] encode the temporal information by extracting the contours from the silhouettes and weighting them according to the position of the corresponding frame in a gait cycle. Another gait descriptor, Gait Flow Image (GFI) [29], is based on the binarized optical flow between consecutive silhouette images and reflects the step-by­ step changes of silhouettes. One more silhoutee-based gait representation technique preserving temporal information is frequency-domain gait entropy (EnDFT) [30]. This approach combines GEnI with discrete Fourier transform (DFT) [31] where the mean of binary silhouette masks with exponentially decreasing weights is calculated. EnDFT is the mixture of these approaches, the entropy is computed from DFT instead of GEI. Despite the equal simplicity and naturalness of these methods, it is GEI that are used and developed so far. Similarly to ordinary black-and-white images, GEI can be used for further features calculation, such as histograms of oriented gradients (HOG-descriptors) [32; 33] or histograms of optical flow (HOF-descriptors) [34; 35], or for constructing more complex classification algorithms that use the specifics of the gait recognition problem. 20 Binary silhouette GEI Figure 1.6 — Examples of basic gait descriptors GEnI One of the most successful multi-view approaches [36] also uses GEI as basic features. It is a Bayesian approach considering the gait energy images as random matrices being the sum of the gait identity and some independent noise corresponding to different gait variations (such as viewpoint, different clothes or the presence of carried things), and both terms are supposed to be normal random variables. Considering the joint distribution of GEI pairs under the assumption that the corresponding classes coincide or differ reduces the problem to the optimization problem for the covariance matrices that can be solved using the EM algorithm. Another approach considering correlation [37] is an adaptation of the canonical correlation analysis for gait recognition problem. Another successful classical approach that uses GEI as gait features is based on linear discriminant analysis (LDA) [38]. The idea of making an embedding such that the features get more clusterized is proposed in several works [39—42]. In MVDA approach [40] a multi-view discriminant analysis is carried: for gait features computed for each viewing angle, a separate embedding is learned to minimize the intraclass variation and maximize the interclass one. Similar approach is proposed in ViDP (view-invariant discriminative projection) [42] aiming to decrease the difference between the features of the same identity regardless the view and increase the gaps between different identities, and in [41], where a unified framework is proposed for learning the metrics of joint intensity of a pair of images and the spatial metric. The sequential optimization of both metrics leads to a model that surpasses the basic discriminant analysis models in recognition quality. Let us finish the descpition of silhouette based approaches by mentioning two more methods [43; 44]. In these approaches the natural assumaption is made 21 that the same-view recognition is easier than the cross-view one, thus, the view transformation models are proposed to embed the descriptors corresponding to different views in a common subspace. The described approaches are intuitively clear and mathematically simple, which, additionally to high recognition results, gives them an advantage over many other more complex methods. A common drawback of most of these multi-view recognition methods is the need to calculate the gait descriptor for each viewing angle present in the database. Therefore, for each frame of the video one needs to know the shooting angle, which is not always possible in real data. Human pose Another important source of information, complementing silhouettes, is the human skeleton. Many recognition methods are based on human pose investigation: the position of the joints and the main parts of the body and their motion in the video while walking. Approaches based on posture range from fully structural (considering the kinematic characteristics of the pose) to more complex, combining the kinematic and spatial-temporal characteristics. The work [45], in which the recognition takes place both by gait and by appearance, can be included in the first group. The main characteristics of the gait used in this approach are absolute and relative distances between the joints, as well as features based on the displacement of key points of the figure between frames. Various kinematic parameters, such as trajectories of the hip, ankles and knees, and relative angles between different joints are also proposed [46] for gait recognition, additionally to gait period and walking speed. The authors of one more approach [47] introduce a more complex mathematical model, considering a family of smooth Poisson distance functions for constructing a skeleton variance image (SVI). Several other works also propose the models based on the position of human body parts, but use them in conjunction with features of a different nature. For example, the method [48] combines a kinematic approach with a spatial one, considering both the trajectories of the joints movement and the change in silhouette shape over time as dynamic features of gait. 22 Body points trajectories Another source of information that can be used for gait recognition is the set of point trajectories. The Fisher motion descriptors are proposed [49] to be constructed on the trajectories and further classified by the support vector machine. 1.3.2 Automatic feature training Despite the abundance of structural non-deep approaches, convolutional neural networks (CNN) have a strong position in all tasks of computer vision, including gait recognition. Over the past few years, many neural network identification methods for gait have been proposed, differing both technically (by choosing network architectures, loss functions, training methods) and ideologically by the method of data processing and extracting the primary features supplied to the network input. The similarity of gait and action recognition problems means that many approaches applied to the latter can be used for the former. The first and most pushing deep method was proposed by Simonyan and Zisserman [50]. To recognize human actions, they trained a convolutional network with an architecture comprising two similar types of branches: image and flow streams. The former processes raw frames of video to get the spatial presentation, while the latter gets the maps of optical flow (OF) computed from pairs of consecutive frames as input to extract the temporal information. To consider long-duration actions, several consecutive flow maps are stacked into blocks that are used as network inputs. Many action recognition algorithms based on this framework apply variant top architectures (including the recurrent architectures [51] and methods for fusing several streams [52]). Another method based on optical flow considers temporal information using three-dimensional convolutions [53]. Actually, the optical flow is quite similar to dense trajectories [49], the difference is that considering trajectories we fix the initial “starting” point while the optical flow is each time considered in a new location. Thus, one more approach [54] proposes to get temporal information not from the optical flow, but from the trajectories computed from the video. Combining these trajectories with spatial features extracted from neural network gives enough information for gait recognition. The second and most popular source of information, used for neural network training, is the binary masks of silhouettes, which have already been discussed when considering non-deep methods. In the simplest DeepGait method [55], the 23 convolutional architecture is trained to predict the individual by separate silhouettes. The network is further used to extract features, and the aggregation of individual frame descriptors over the entire video occurs by selecting the maximum response over the gait cycle. This method is the simplest of all deep approaches, since when the silhouette masks of people are available no additional preprocessing is required. The length of gait cycle is determined by considering the autocorrelation of a sequence of binary images. DeepGait is developed by [56] approach where the Joint Bayesian method [36] is applied to the neural features obtained from the silhouettes. Another method using the silhouettes themselves is proposed in view-invarian gait recognition network (VGR-Net) [57]. The two-step algorithm firstly determines the shooting angle of the video, and then it predicts the person basing on the initial data and model for the angle found (its own for each one). To consider not only the spatial, but also the temporal characteristics of the motion, not separate masks, but blocks of several consecutive silhouettes are fed to the input of the network. In addition, the variability between frames is taken into account due to the network architectures in both subtasks: the authors use three-dimensional networks in which convolutions are performed not only in spatial domain, but also in temporal one. In other work [58] where 3d convolutions are proposed to capture spatio-temporal features, one network used for all views shows almost perfect quality of same-view recognition. It is also worth to highlight a variety of methods utilizing the popular GEI mentioned above. Models based on them range from the simplest GEINet [59], where a small network predicts a person from the gait energy images calculated for different angles, to more complex ones [60], where the similarity of a GEI images is determined and various methods of comparing neural network gait features obtained from energy images are being investigated. In other works [61; 62], also using Siamese architectures with two or three streams trained by contrastive or triplet loss optimization, it is determined which gait images are close and belong to the same person, and which images belong to different people. The first approach able to outperform [60] being state-of-the-art for several years is GaitSet [63] considering the sets of silhouettes that are invariant to permutation and allow to mix the frames obtained under different views. The authors consider various aggregation functions and their combination (Set Pooling) to get set-level features from frame-level ones and also present Horizontal Pyramid Mapping which splits the neural feature maps into the strips of different scales in order to aggregate multi-scale information. 24 Besides, most of the architectures and approaches popular in recent years are successfully used for gait recognition. For example, autoencoders used for image generation and representation learning are also applicable for identification [64]. It is proposed to solve the problems of variability of views and carried items using a variety of autoencoders, each performing its own transformation similar to view transformation models [43]. Thus, passing through a sequence of “coding” layers, the image is transformed to a side view, which is easier for recognition. Similar approaches are proposed using generative adversarial networks (GAN) [65; 66] to convert the GEI to the “basic” view (lateral viewing angle, no bag and coat). Additionally to the classical forward propagation convolutional networks, recurrent neural networks can be used for gait recognition as well as for other video analysis tasks. Recurrent architectures allow to calculate informative dynamic gait features directly from the source data. For example, LSTM [67] layers are used to aggregate features obtained from raw frames and optical flow maps [51] or silhouette masks [68]. However, there are many approaches that apply recurrent layers not just to hidden neural representations, but to preliminary trained exact interpretable features. Thus, the authors of several methods [69—71] construct a pose estimation network whose outputs are then fed to LSTM layer. Such pose-based approaches are more general since they do not take the shape of human figure into account. Battistone and Petrosino [72] go further and unite the recurrent neural network with a graph approach exploiting both structured data and temporal information. Most of appearing neural network approaches are being applied to gait recognition which leads to active developement of this area. 1.3.3 Event-based methods Since the research of this thesis is evolving towards dynamic vision, the review of event-based methods has to be provided. Dynamic vision sensors [4—6] have become very popular recently. Due to their memory and temporal effectiveness, they are integrated in video surveillance systems, thus, most of computer vision problems have to be solved for the data obtained by these sensors. Recently, many solutions have been proposed for different related problems, however, gait recognition challenge had not been considered in termes of dynamic vision untill the study presented in this work. Nevertheless, the approaches for some computer vision problems are worth mentioning. 25 For example, many different models are proposed for object detection and tracking problems [73—75] that usually play an auxiliarry role for gait recognition. Another close problem is action recognition which is also considered in terms of DVS data [76; 77]. In the latter one [77] three two-dimensional projections of the event stream are considered: the natural x-y projection consisting of the events aggregated over the time interval together with x-t and y-t projection where events are accumulated over horizontal and vertical intervals. The features are extracted from these maps using Speeded Up Robust Features (SURF) [78] and than clusterized and classified. One more related problem is gesture recognition [79; 80], which can not be solved only by appearance but requires the motion as well as gait recogntion. The first approach [79] is based on Hidden Markov Models, while the second one [80] uses the methods of deep learning. The last problem that is worth mentioning while discussing the event-based vision is optical flow estimation. Since both OF and event stream reflect the motion, the methods of OF computing from the DVS data are developed especially actively. The approaches vary from absolutely precise mathematical [81] methods where the differential equations relating OF and event stream are constructed and solved, to more “trainable”, such as deep EV-FlowNet [82] approach based on encoder-decoder neural architecture. Various methods for event data are appearing constantly leading to simplification the processing of such data and spreading dynamic vision sensors over the video surveillance field. 26 2. Baseline model 2.1 Motivation Being a problem of biometrical identification, gait recognition has a lot in common with other classical problems such as face, iris, or fingerprint recognition. In terms of machine learning, all of these problems can be stated in a similar way: having a sequence of frames with a walking person (or an image of face, fingerprint, etc.) one needs to define an Id of presented person from the labelled database, i.e. classify the object. Mathematically, all the difference is the dimensionality of input tensor due to the additional temporal dimension of the video. One more similar computer vision problem is re-identification (re-id) [83]. This problem also generally concerns videos, but is usually based on the comparison of separate frames and defining if there is the same person in different frames according to the appearance rather than motion. However, the presense of continious frame stream changes the nature of the problem. Considering gait recognition problem, one can see, that the temporal component contains much information which can be used for better identification. This makes the investigated problem close to action recognition and other video classification tasks, and allows to apply the methods developed for these problems (e.g. [84]) for gait recognition. Nevertheless, gait recognition is still not the same problem as action recognition. The set of possible actions is much wider than the set of different gaits. Despite the fact, that each person has his/her own unique manner of walk, the variation of gaits is quite small. While two actions are usually easily distinguishable for human’s eye, the gaits of two people are much more close and hard to differentiate. Most of the actions can be identified by the “appearance” characteristics. For example, if a person holds a cellphone near the ear, he probably talks to someone on the phone. Or if a persons figure is hanging in the air, he must have jumped. But the gait is still the same “walking” action and if people have similar complection and clothing or the resolution is low, the motion becomes more important than people’s appearance. To separate the problem of gait recognition from re-identification and face recognition, one need to analyse the movements of the points rather than raw frames containing the appearance. Besides this, most of the characteristics of image 27 appearance, such as the color or contrast, do not influence the gait properties (see Fig. 2.1), which confirmes that the motion has to be extracted from the video. Figure 2.1 — The same frame with different brightness, contrast and saturation settings For these reasons, the translations of the points are the data of great interest. Mathematical object containing the information about these translations is optical flow (OF), thus, it is optical flow that is chosen to be the main source of information for gait recognition. Formally, OF is the pattern of apparent motion of objects in a visual scene caused by the relative motion between an observer and a scene [85; 86]. It reflects the visible movement of the points of the scene between consecutive video frames (Fig. 2.2). In mathematical terms, optical flow is a vector field responding to the translation of the points. Assuming that 𝐼1 and 𝐼2 are two consecutive frames of the video with width 𝑊 and height 𝐻, 𝑂𝐹 is a 3-dimensional tensor of size 𝑊 × 𝐻 × 2, such that first frame second frame optical flow Figure 2.2 — Optical flow as a vector field of motion between consecutive frames 𝐼2 [𝑥 + 𝑂𝐹 [𝑥, 𝑦, 0], 𝑦 + 𝑂𝐹 [𝑥, 𝑦, 1]] = 𝐼1 [𝑥, 𝑦], (2.1) 28 i.e. the point with 𝐼1 frame coordinates (𝑥, 𝑦) shifts to (𝑥 + 𝑂𝐹 [𝑥, 𝑦, 0], 𝑦 + 𝑂𝐹 [𝑥, 𝑦, 1]) on 𝐼2 . If the camera is static, the background does not move and change, then only the moving objects generate the flow and zero vector field corresponds to the background. Optical flow usage was proposed for action recognition in [50] which was one of the most successfull video classification approaches in its time and encouraged lots of researchers to analyse optical flow additionally to raw frames. Since the gait recognition is much more motion dependent than general actions, optical flow is supposed to contain all the information needed for motion analysis and recognition people by walking manner. Besides, using the OF maps as the sourse data one can be sure, that the recognition is made not by person’s appearance, clothes, body type or shape, but by motion patterns. 2.2 Proposed algorithm Figure 2.3 — General algorithm pipeline In this section, the pipeline of the proposed baseline algorithm is described and the details of all the steps are explained. Figure 2.3 shows the common scheme of recognition process. The whole algorithm consists of descriptor calculating and their classification corresponding to the orange and green blocks on the scheme, respectively. The feature calculation step, in turn, includes preparing and preprocessing of the data, neural descriptors extraction and feature postprocessing and aggregation to get one final gait descriptor. The following subsections focus on each of these four main steps of the algorithm: 1. Data preprocessing; 2. Neural network backbone; 3. Feature postprocessing; 4. Classification. 29 2.2.1 Data preprocessing As explained in section 2.1, the optical flow maps are chosen as the source data for the algorithm. Unlike two further steps of feature calculating, the preprocessing stage of the proposed algorithm is non-trainable, all the parameters and hyper­ parameters are chosen and fixed beforehand and no optimization is made. Thus, the preprocessing can be made once and the data can be stored in a processed form in order not to repeat the same calculations on each training and testing iteration. For this reason, the first preparation step is the estimation of optical flow between each pair of consecutive frames and saving it in a convenient format. Farneback [87] algorithm was chosen for OF estimation due to its stability and the presence of open source implementations in different computer vision libraries (e.g. OpenCV library [88]). The algorithm consists of two steps: quadratic approximation of each pixel neighbourhood by polynomial expansion transform and point displacement estimation based on the transformation of the polynomials under translations. The output of the estimator is a tensor 𝑂𝐹 described at the page 27. Inspired by [51], we store these tensors as images. To that end, we make a linear transformation of the maps, all the values of OF to lie in the interval [0, 255] similarly to common RGB images. Besides this, since the tensor has 2 channels (horizontal and vertical components of the flow), we add an auxiliary zero map to get three channel tensor and to be able to save it similarly to raw frames. Two separate components of the flow and the constructed three channel image are presented in the Figure 2.4. horizontal component vertical component 3-channel image Figure 2.4 — The visualizations of OF components and the constructed image Although optical flow maps contain information about motion themselves, in order to enhance the temporal component and to use not only instantaneous but continuous motion, the maps are used not separately, but jointly. In more details, several consecutive maps are stacked into one block and such blocks are used as the 30 network input. So, instead of two channel maps, the network is fed by the tensors with 2𝐿 channels, where 𝐿 is the number of frames in a block. Since usually the frame contains not only the human figure and there can be other visible moving objects and background, the bounding boxes containing exactly the considered figure has to be cropped from the maps to get rid of the noise and the errors. To get this box, any human figure detection algorithm can be applied. However, in recent years lots of various investigations in human body analysis field have been made, and human figure detections often turns out to be an auxiliary problem that is solved together with other more complex challenges. For example, human figure bounding box can be calculated while pose estimation. OpenPose algorithm [89] estimates the location of the main joints of the body of all the people presented in a frame, and having these locations found, one can easily calculate their bounding box that is the bounding box for the whole figure. This method has been chosen in the baseline method. In chapter 3 it will be shown how to use all the information obtained from OpenPose algorithm for recogntion improvement. To prevent relative shifts of the maps, one common bounding box is calculated for the whole block. Specifically, let {𝐼𝑡 }𝑇𝑡=1 be the video sequence with a moving person, {𝐵𝑡 }𝑇𝑡=1 be the bounding boxes containing the figure for every frame. It is convenient to encode each box by the coordinates of its upper left and lower right vertices 𝐵𝑡 = (𝑥𝑙𝑡 , 𝑦𝑡𝑢 , 𝑥𝑟𝑡 , 𝑦𝑡𝑙 ). Having the bounding box for each frame, the outer box that contains all the boxes in block inside is constructed. For block that consists of the OF maps for frames 𝐼𝑘 , 𝐼𝑘+1 , . . . , 𝐼𝑛 , the outer box coordinates are 𝑥𝑙 = min (𝑥𝑙𝑡 ), 𝑡=𝑘,...,𝑛 𝑦 𝑢 = min (𝑦𝑡𝑢 ), 𝑡=𝑘,...,𝑛 𝑥𝑟 = max (𝑥𝑟𝑡 ), 𝑡=𝑘,...,𝑛 𝑦 𝑙 = max (𝑦𝑡𝑙 ). 𝑡=𝑘,...,𝑛 This box is then made square by widening or lengthening (depending on the aspect ratio) with maintaining the center of the box. The OF images are stored together with all the bounding boxes, and prior to inputting the data into the network the certain number 𝐿 of consecutive horizontal and vertical components of OF maps are concatenated to one block and the general square patch is cropped from the block. Various values of 𝐿 and arrangement of 31 Figure 2.5 — Input block. The square box cropped from 10 consecutive frames of horizontal (the first and the second rows) and vertical (the third and the fourth rows) components of normalized OF maps the blocks were considered but the best results were received having 𝐿 = 10 and 5 frame overlap of the blocks. So, 𝑘-th block consists of the maps {𝑂𝐹𝑡 }5𝑘+5 𝑡=5𝑘−4 , 𝑘 ⩾ 1. Figure 2.5 shows the cropped block obtained by such a procedure frame by frame. 2.2.2 Neural network backbone The main part of the algorithm is based on neural networks. Being the powerfull tool for many computer vision problems, the deep learning technique is used in the proposed algorithm for embedding training and further feature extraction. All the considered neural networks are trained for classification task by classical backpropagation method. However, to get one general model which does not require re-training in case of adding a new subject into a database, we 32 do not use a neural network as a classifier itself, but cut its head and compute the hidden representations of the objects. Such an approach allows to make the feature extraction process easier and reduce the model updating to re-fitting or fine-tuning a simple classification model, which takes much less time and computations. The following part of this section will describe the steps that are performed in order to get neural network gait descriptors: 1. data augmentation; 2. neural network architecture selection; 3. training procedure; 4. feature extraction. Data augmentation As noted in section 1, one of the greatest problems preventing the general solution of gait recognition challenge is a lack of data. Gait dataset collection is a complex process and currently even the largest collections are relatively small. This factor makes the application of deep learning methods harder, since even not very deep architectures usually contain a large number of parameters to optimize and, hence, large amount of data is required to be trained on. Thus, data augmentation is needed to increase the number of different training objects. During the training process, we firstly decrease the resolution of the frames to 88 × 66 to make the network input size not very large, and then construct the blocks by random cropping a square containing the block bounding box of the figure. If both sides of the box (after the proportional box transformation corresponding to resolution decrease) are less than 60 pixels we uniformly choose the left and upper sides (𝑥 and 𝑦, respectively) of the square so that 𝑥 is between the left border of the frame and the left side of the box and 𝑥 + 60 is between the right side of the box and the right border of the frame. The upper bound 𝑦 of the square is chosen similarly. If any of the box sides is greater than 60 pixels (which happens rarely since 60 is almost the whole height of the frame) we crop the square with the side equal to the maximum of height and width of the box and then resize it to 60 × 60. Thus, we get an input tensor of the shape 20 × 60 × 60, where 20 is the number of channels corresponding to 10 pairs of horizontal and vertical components of optical flow, and 60 is the spatial size of each map. In addition, we flip all frames of the block simultaneously with probability 50% to get the mirror sequence. Since we use 33 the optical flow instead of raw images, the reflection is made as follows: horizontal component: 𝑂𝐹 [𝑥, 𝑦, 0] := −𝑂𝐹 [𝑤 − 𝑥, 𝑦, 0], vertical component: 𝑂𝐹 [𝑥, 𝑦, 1] := 𝑂𝐹 [𝑤 − 𝑥, 𝑦, 1], where 𝑤 is the width of the frame equal to 66 due to resolution decrease. These two transformations help us to increase the amount of the training data and to reduce the overfitting of the model. Besides, before inputting the blocks into the net we subtract the mean values of horizontal and vertical components of optical flow over all training dataset. Neural network architectures and training process We consider several architectures of convolutional neural networks and compare them. CNN-M architecture We started from the architecture used in [90] that is based on the CNN-M architecture from [91] and used model based on this architecture as a baseline. The only modification we made is using Batch Normalization (BatchNorm) layer [92] instead of Local Response Normalization (LRN) [93] used initially. According to numerous investigations, batch normalization allows to train each layer a bit more independantly, which is very usefull when we extract the hidden representations from the network. Besides this, this normalization prevents the unlimited growth of activations allowing to increase the learning rate. And finally, it works as a regularization and reduces the overfitting of the model, which is incredibly profitable due to the relatively small amount of training data. Thus, BatchNorm has shown the great accelaration and improvement in trainability and was selected as the main normalization method while network training. The architecture is described in Table 2.1. The first four columns correspond to convolutional layers. They are determined by the kernel size (the first row), number of output channels (the second row), the presence of additional transformations following the convolution, such as normalization, stride or pooling (the third and fourth rows), and the activation function (the last layer). The first one has 96 filters of size 7 × 7, followed by BatchNorm and 2 × 2 MaxPooling layer. The second convolution consists of 192 34 Table 2.1 — The baseline architecture C1 C2 C3 C4 7×7 5×5 3×3 2×2 96 192 512 4096 BatchNorm Pool 2 Stride 2 Pool 2 Stride 2 Pool 2 ReLU ReLU ReLU ReLU F5 F6 SM 4096 2048 subjects number Dropout Dropout ReLU ReLU SoftMax filter with size 5 × 5 and 2 × 2 stride and pooling. Then the third layer comes with 512 kernels of size 3 × 3 and the same stride and pooling parameters. The convolutional part of the network ends with C4 layer which consists of 4096 filters of size 2 × 2 without further dimensionality reduction. After this block three fully­ connected layers are applied. The fifth and sixth layers consist of 4096 and 2048 units, respectively, and they both are followed by dropout [94] layers with parameter 𝑝 = 0,5. The first six layers are followed by rectified linear units (ReLU) activation. The last one is a predictional layer with the number of units equal to the number of subjects in the training set and softmax non-linearity. The model with such architecture and cross-entropy loss function was trained by Nesterov Momentum gradient descent method with starting learning rate equal to 0,1 reducing by a factor of 10 each time the error became constant. Despite not very large size of the whole net, it was trained in four steps. Starting with the fourth, fifth and sixth layers of size 512, 512 and 256, respectively, the sizes of these layers were doubled each time the loss function stopped decreasing. Two ways of widening the layers were considered: adding random values of new parameters and net2net method [95] developed for rapid transfer of the information from one network to another. The latter approach really does not lose any information, and the validation quality of the net before widening and after it is the same. However, it turns out, that such a saving of information obtained from previous training steps increases the overfitting and worsens the testing quality. And while adding randomly initialized new weights the random component works as a regularization and improves the classification accuracy. Glorot [96] weight initialization with uniform distribution was chosen for new weights sampling and 35 confirmed the positive influence of adding noise inside the model while training. More precisely, let 𝐿𝑖 be a dense layer that has to be widen. Let 𝑛 be its initial size and 2𝑛 be the required size. Then three network parameters should be modified: the weight matrix 𝑊(𝑖−1)→(𝑖) and the bias 𝑏(𝑖−1)→(𝑖) between layers 𝐿𝑖−1 and 𝐿𝑖 and the weight matrix 𝑊(𝑖)→(𝑖+1) between 𝐿𝑖 and 𝐿𝑖+1 . Each of these parameters doubles one of its dimensions, thus, the random matrix (or vector) of the same shape is sampled from Glorot Uniform distribution and then concateneted with the pretrained matrix. Because these layers are fully-connected, appending new dimensions with random values adds some noise to layers outputs, but since these values are distrubuted around zero, the noise turns out to be small enough and regularizes the training. CNN-M fusion The first modification of the CNN architecture is a slow fusion of several branches of the same network (similar to the idea from [84]). Instead of one big 10 frame block taken as an input, 4 consecutive blocks of the smaller size are used. In more detail, 4 blocks of OF maps are taken, each one consisting of 5 pairs of maps, and passed through distinct branches of convolutional layers with shared weights. After the fourth convolutional layer, the outputs of the first two branches and the last two ones are concatenated so that two streams remain instead of four. These two outputs are passed through the fully-connected fifth layer after which two outputs are concatenated to one and continue passing through remaining layers of the network. Figure 2.6 presents the scheme of proposed architecture. This approach allows to take a larger period of walking into account without losing instantaneous features. This net was trained stage-by-stage, as well. VGG-like architecture The third architecture considered in this research was similar to the architectures from VGG family [97]. The idea of using blocks of convolutions with small kernel sizes has proven itself very successful in pattern recognition area. Nevertheless, although each filter in VGG architecture has only 3 × 3 size, the depth of the network makes the number of parameters extremely large relative to the available dataset sizes. To prevent overfitting, the truncated VGG-16 architecture was taken, namely, the last convolutional block was removed from the network. The details are described in Table 2.2. 36 Figure 2.6 — The scheme of CNN-M fusion architecture Table 2.2 — The VGG-like architecture B1 B2 B3 B4 F5 F6 SM 3 × 3, 64 3 × 3, 64 3 × 3, 128 3 × 3, 128 3 × 3, 256 3 × 3, 256 3 × 3, 256 3 × 3, 512 3 × 3, 512 3 × 3, 512 4096 4096 subjects number Pool 2 Pool 2 Pool 2 Pool 2 ReLU ReLU ReLU ReLU Dropout Dropout ReLU ReLU SoftMax The first four columns correspond to blocks of convolutional layers. All the convolutions have 3 × 3 filters, the first block consists of two layers of size 64, the second one consists of two layers with 128 filters each, and the third and the fourth blocks contain three layers with 256 and 512 filters, respectively. All the convolutional blocks are followed by MaxPooling layers. After convolutional blocks, there are three dense layers. Two of them consist of 4096 units and are followed by dropout with parameter 𝑝 = 0,5, and the last one is softmax layer. All layers except the last one are followed by ReLU non-linearity. This net was also trained step by step, starting from the size of two fully connected layers equal 1024 and doubling them when the validation error stopped decreasing. In all the architectures 𝐿2 regularization on F5 and F6 layers was added to cross-entropy loss function in order to avoid too large weights in dense layers leading to overfitting. The regularization parameter increased step by step from 0,02 to 0,05. 37 Feature extraction When the net is trained it is used as a feature extractor. The SM dense layer is removed from the network and the outputs of the last hidden layer are used as gait representations. The steps described above allow to get a feature vector for each block of optical flow maps. During the testing phase of the algorithms these blocks are constructed deterministically: no flips are made and the patches are cropped choosing the mean values of the domains of left and upper bounds. 2.2.3 Feature postprocessing and classification Having a feature vector for each OF block, one needs to get a final response for this set of features, i.e. process the obtained descriptors, aggregate them into one gait descriptor and classify it. This is what this section is focused on. It is worth noting, that in the most trivial case when the subjects from the training set are the same as the subjects from the testing one, the network itself can be used as a classifier that is already fitted. It is equivalent to the logistic regression of the probability distribution over subjects and, hence, the linear classification of neural descriptors. Nevertheless, if training and testing subjects differ the classifier requires fitting, so, the feature learning and classification are proposed to be made separately. As the gait feature classification method the Nearest Neighbour (NN) classifier has been chosen. Being one of the simplest methods in machine learning, it corresponds to the search of the most similar (according to some similarity measure) object through the database. Despite its simplicity, this approach shows high classification accuracy and often outcomes other popular classification methods such as logistic regression or support vector machine (SVM). Being a metric algorithm, nearest neighbour classification is much more successful if the features are preliminarily scaled, thus, 𝐿2 normalization of each block descriptor is made rigth after its extraction. However, the direct search of the closest descriptor for each block does not result in a high recognition quality. Since each block contains quite a short video subsequence, it does not capture all the motion and the features learned from this data may depend on the location of the considered block in a gait cycle. Thus, the closest descriptor may correspond to the block with the same position belonging to other subject. To get rid of position dependence, we firstly average the features over the whole video to one gait descriptor 38 and then search for the most similar one among other video descriptors. Additionally to position independence, such aggregation allows to accelarate the further classifier fitting and application, since the number of training and testing descriptors reduces a lot: only one descriptor remains for each video regardless its length. Comparing to voting method, where the results are aggregated after classifier applying, this approach shows great gain in speed as well as in quality. After the average gait descriptor is calculated for each video, the closest one from the database is found and returned as the response. Additionally to the “classical” Euclidian distance between vectors, Manhattan distance is considered as the base for NN classifier. It turns out that this metrics is more stable and appropriate for measuring the similarity of gait descriptors and shows a better result in the majority of experiments. The described process allows to make fast and accurate baseline classification of neural descriptors. However, there are two auxiliry modifications that can be made with the descriptors to improve the recognition. First of all, principal component analysis (PCA) can be provided to reduce the dimensionality of feature vectors and get rid of any noise they contain. Besides this, such a low-dimensional projection accelerates the fitting of any classifier. Being a linear transformation, PCA could be united with the last dense layer of the network, but the non-linear descriptor normalization is made between them and the experiments show that such a solution increases the model accuracy. Figure 2.7 — The pipeline of the algorithm based on blocks of optical flow maps The further recognition improvement can be made by Triplet Probability Embedding (TPE) [98]. This method aims to find a low-dimensional discriminative embedding (projection matrix) to make features of the same object closer to each other than the features of different objects. It is trained to find the embedding 39 such that 𝑆𝑊 (𝑣𝑎 , 𝑣𝑝 ) > 𝑆𝑊 (𝑣𝑎 , 𝑣𝑛 ), (2.2) where 𝑆 = 𝑆𝑊 is a similarity which is usually defined as cosine measure, 𝑊 is a parametrization of the embedding, features 𝑣𝑎 , 𝑣𝑝 belong to the same object, and 𝑣𝑛 belongs to the other one. The inequality (2.2) is quite abstract, and the following optimization problem is formulated to achieve it. Let us consider the probability of the triplet (𝑣𝑎 , 𝑣𝑝 , 𝑣𝑛 ): 𝑝𝑎𝑝𝑛 𝑒𝑆𝑊 (𝑣𝑎 ,𝑣𝑝 ) = 𝑆 (𝑣 ,𝑣 ) 𝑒 𝑊 𝑎 𝑝 + 𝑒𝑆𝑊 (𝑣𝑎 ,𝑣𝑛 ) (2.3) The closer 𝑣𝑎 and 𝑣𝑝 are relative to the similarity of 𝑣𝑎 and 𝑣𝑛 the higher this probability 𝑝𝑎𝑝𝑛 is. Thus, in order to get the optimal parameters 𝑊 one can follow the maximum likelihood method and maximize this probability or its logarithm which is usually easier for optimization. ∑︁ 𝐿𝑜𝑠𝑠 = − log(𝑝𝑎𝑝𝑛 ) → min (2.4) 𝑊 (𝑣𝑎 ,𝑣𝑝 ,𝑣𝑛 ) Hence, the problem of finding 𝑆𝑊 reduces to finding 𝑊 using any optimization method. For the gait descriptors, this approach has been changed by using the Euclidian distance instead of the scalar product. By defining the similarity as 𝑆𝑊 (𝑢,𝑣) = −||𝑊 𝑢 − 𝑊 𝑣||22 , the optimization problem of likelihood maximization turns to 2 𝑒−||𝑊 𝑣𝑎 −𝑊 𝑣𝑝 || log −||𝑊 𝑣 −𝑊 𝑣 ||2 → max 𝑎 𝑝 𝑊 𝑒 + 𝑒−||𝑊 𝑣𝑎 −𝑊 𝑣𝑛 ||2 𝑣𝑎 ,𝑣𝑝 ,𝑣𝑛 ∑︁ (2.5) that can be solved by stochastic gradient descent (SGD) method. Additionally to improved performance, this embedding maintains the advantages of dimensional reduction such as memory and time effectiveness. Such an embedding shows the best performance and finishes the description of the proposed algorithm. The whole model pipeline is shown schematically in Figure 2.7. 2.3 Experimental evaluation In this section, the details of the experiments and their numerical evaluation is described. 40 2.3.1 Evaluation metrics Having a nearest neighbour classifier on the top, proposed approach outputs an ID of the most probable subject considered video belongs to. Thus, the most simple and convenient metrics to evaluate the quality of the algorithm is accuracy, or Rank-1 reflecting the ratio of correctly classified objects. However, to make more thorough comparison of all the models, Rank-k metrics is considered for 𝑘 = 5. This measure equals the ratio of samples for which the correct label belongs to the set of 𝑘 most probable answers. To evaluate Rank-k, one needs to predict not only the most probable class, but the probability distribution over all the classes or 𝑘 most probable classes. To find this 𝑘 most probable classes, the classes are sorted by the distance to the test object. The distance between the class and the test object is defined by the closest to test object class representative. Thus, all the gallery samples are considered by distance increasing passing the samples corresponding to already presented classes. Five “closest” classes are supposed to be the most probable ones and are used for Rank-5 calculation. 2.3.2 Experiments and results All the methods described above were evaluated on two colored gait datasets: TUM-GAID [7] and CASIA Gait Dataset B [9]. For the first one classication train­ test split was selected: 150 training subjects and 155 testing ones provided by the authors of dataset. With regard to CASIA dataset, there are no single training protocol, different researchers split the collection differently. Since the baseline model was proposed for side-view recognition, only the videos captured under 90° angle were used in the experiments. Similarly to TUM, the dataset was splitted approximately in half: 60 subjects were used for training the network and the remaining 64 subjects formed the testing set. While the training parts of the datasets were used for neural network and TPE training, the evaluation was made on the testing subsets. Having 10 videos for each person, we fit the nearest neighbour classifier on the first 4 “normal” walks and evaluate the model quality on the rest of videos. All the experiments can be divided into three parts. 1. Distinct learning and evaluating on each of the datasets; 2. Learning on one dataset and evaluating on both of them; 3. Joint learning on two datasets and evaluating on both of them. 41 These experiments aim to estimate the influence of different architectures and embeddings on the quality of classification and compare with existing methods. Besides this, the generalizability of these methods is investigated. Ideally, the algorithm should not depend on the background conditions, the illumination, proximity to the camera or anything other than the walking style of a person. So, it is worth evaluating how the algorithm trained on one dataset works on the other one. The first set of the experiments was conducted to evaluate the efficiency of proposed approach in general. A network has been trained from scratch with the top layer of size equal the number of training subject. Then the low-dimensional embedding was trained and the index for the search was constructed on the fitting data. The results of the experiments are shown in Table 2.3. Table 2.3 — Results on TUM-GAID dataset Method Architecture Embedding (dimensionality) Evaluation Metrics Rank-1 [%] Rank-5 [%] CNN-M PCA (1100) 𝐿1 93,22 98,06 CNN-M PCA (1100) 𝐿2 92,79 98,06 CNN-M PCA (600) 𝐿1 93,54 98,38 CNN-M TPE (450) 𝐿2 94,51 98,70 CNN-M fusion PCA (160) 𝐿1 93,97 98,06 CNN-M fusion PCA (160) 𝐿2 94,40 98,06 CNN-M fusion TPE (160) 𝐿1 94,07 98,27 CNN-M fusion TPE (160) 𝐿2 95,04 98,06 VGG PCA (1024) 𝐿1 97,20 99,78 VGG PCA (1024) 𝐿2 96,34 99,67 VGG TPE (800) 𝐿1 97,52 99,89 VGG TPE (800) 𝐿2 96,55 99,78 𝐿2 98,00 99,60 CNN-M + SVM [90] 42 The experiments show that each next architecture is more successive in gait recognition than the previous one. Small filter kernels presented in VGG model manage to learn more informative features than larger filters in SNN-M model. Besides this, training low-dimentional TPE always increases the accuracy as compared to common nearest neighbour classifier with 𝐿2 metrics. Nevertheless, 𝐿1 metrics often turns out to be more successful. Despite the fact that TPE was trained as optimization of Euclidean similarity, the best result is achieved on the NN classifier based on 𝐿1 distance between the embeddings of the descriptors. It confirms that Manhattan distance is more appropriate for the analysis of neural gait descriptors. Although the accuracy of the proposed models does not exceed Castro’s [90] Rank-1, the correct label is more often among the most probable ones. In order to check how sensitive the proposed method is to small appearance changes, the recognition quality for different clothing and carrying conditions was evaluated separately. The results of such separation are presented in Table 2.4 and show that the recognition in “normal” conditions is almost perfect while the change of shoes and the presence of the backpack really make the identification more complex. Table 2.4 — Quality of recognition on TUM-GAID dataset under different conditions Normal Backpack Shoes Method Rank-1 Rank-5 Rank-1 Rank-5 Rank-1 Rank-5 VGG, 𝐿1 99,7% 100,0% 96,5% 99,7% 96,5% 100,0% CNN + SVM [90], 𝐿2 99,7% 100,0% 97,1% 99,4% 97,1% 99,4% VGG-like architecture has been chosen as the main one for the further experiments as it had given the best results on TUM database. Side-view CASIA recognition based on the same VGG architecture was evaluated as well, however the accuracy on this dataset turned out to be much lower: 71,80% of correct answers using 𝐿2 distance and 74,93% in case of 𝐿1 metrics. It confirms that CASIA database is a challenging benchmark combining small number of subjects and large condition variability. After the networks were trained on each of the datasets the second set of the experiments has been conducted to investigate the generality of the algorithm, i.e. 43 TUM CASIA Figure 2.8 — Side-view frames from TUM and CASIA gait databases the experiments on transfer learning. TUM-GAID videos and the side view of CASIA look quite similar (see Fig. 2.8), so, it is checked if one algorithm can be used for both datasets. Despite good results of the algorithms on TUM-GAID dataset, it turns out that the accuracy of the same algorithm with the same parameters on CASIA dataset is very low. The same happens in the opposite case when the network trained on CASIA is used for extracting features from TUM videos. These experiments were made using truncated VGG which is the best of considered architectures, and the NN classifier based on 𝐿1 metrics without any extra embedding. Table 2.5 shows the accuracy of such transfer classification. One can see that all the results on CASIA Table 2.5 — The quality of transfer learning Training Set Testing Set CASIA TUM CASIA 74,93% 67,41% TUM 58,20% 97,20% are still lower than on TUM-GAID and transfer of the network weights noticeably worsens the initial quality. The third part of experiments aimed to investigate if the stable model can be constructed by means of combining of the datasets rather than using only one of them. For this purpose, the neural network was trained on the the union of the available data and evaluated on each of testing subsets. Since all the subjects in two databases are distinct, they can be united into one dataset and the network can 44 be trained once on this union. The obtained model is supposed to learn the gait features from both collections and thus be more general than the previous models. Because two parts of the united dataset have different sizes, the input batches had to be balanced while training the network. Each training batch was composed of the same number of objects from both databases. Thus each OF block from the “small” CASIA dataset has been sampled more frequently than blocks from TUM. The evaluation of the algorithm was made individually for each dataset and the results are shown in Table 2.6. Despite joint learning procedure and class balance, the quality of such united model turns out to be lower than the quality of distinct algorithms for each dataset. It reflects the overfitting of the latter models which was not prevented by dropout and regularization. Table 2.6 — The quality of the jointly learned model TUM CASIA Model Rank-1 Rank-5 Rank-1 Rank-5 VGG + NN, 𝐿1 balanced learning 96,45% 99,24% 72,06% 84,07% VGG + NN, 𝐿2 balanced learning 96,02% 99,35% 70,75% 83,02% 2.4 Conclusions The experiments show that optical flow contains enough information to make gait recognition. Even if the shape of the human figure changes (for example, when the bag or backpack appears) the nature of the motion is preserved and the person can be recognized. However, the available datasets are quite small and not enough for training general stable algorithm. The algorithms based on deep neural networks overfit to dataset conditions and can work well if all the settings are fixed, while the change of the recording conditions worsens recognition quility. Nevertheless, even being imperfect, the transferred model still copes with unnkown data and makes relatively accurate classification. 45 Starting from this baseline model, the approach is developed both ideologically by investigating the motion in more detail and constructively by considering new neural network architectures and training methods. 46 3. Pose-based gait recognition This chapter focuses on the developement of the baseline method. Continuing using optical flow as the main source of information, it is suggested that the algorithm should consider the motion in more details in order to catch more important gait characteristics. 3.1 Motivation In this research, the problem of human identification in video is reduced to gait recognition. According to medical encyclopedia [99], the gait combines the biomotorics of the free limbs with the movements of the trunk and head in which the mechanism of muscular coordination is regulated by the mechanisms of movement, maintaining posture and body balance. However, despite the formal medical definition, the gait is the manner of walk and some parts of human body participate in walking process more than the others. Thus, the natural assumption is made that different body parts affect the gait representation differently and some of them are more informative than the others. For example, the hands and head seem to be not very informative as while walking a person can move his/her hands independently (waving to somebody, carrying a bag, speaking on the phone or just keeping the hands in pockets) or spin the head looking around (see Fig. 3.1 for visual examples). Thus, the features obtained from the hands’ and head’s motion are supposed to be quite noisy and not usefull for recognition. Since the legs are the limbs most involved to the walking, the bottom part of the body is expected to carry more gait information and therefore to draw more attention than head or hands. (a) (b) (c) (d) (e) (f) Figure 3.1 — Gait independent motions of hands and head that are not informative 47 Consequently, it is proposed to estimate human pose, define some key points to consider them in more detail and limit the analysis of optical flow maps to the neighborhoods of such key positions. In so doing, we do not stop considering complete motion of the full body: the hierarchy of multi-scale areas of human figure from full body to exact joints is considered and the motion of the points in these shrinking areas is observed. Looking ahead, let us note, that sometimes the full coloured image sequences are not available and only silhouettes are presented (as, for example, in OULP dataset [8]) described in section 1.2. It can happen due to very bad lightnening and low quility of recordings which makes the transitions between body parts and pieces of clothing invisible or for legal reasons (personal data protection). In this case, the pose can hardly be estimated, but the optical flow can be computed even for such frames. However, being applied to silhouette masks, the OF reflects the movement of external edges of human figure which is visualized in Figure 3.2. The edges of the figure can also be considered as “parts of the body”, thus such case is investigated as well. Figure 3.2 — The visualization of optical flow between two silhouette masks (two channels are the components of OF and the third one is set to zero) 3.2 Proposed algorithm In this section, the details of proposed posed-based algorithm are described. This approach inherits some of the ideas of baseline presented in section 2, including the general pipeline which consists of data preprocessing based on optical flow, neural features extraction and their further processing and classification. However, each of 48 these steps evolves and undergoes changes. The pipeline of the developed model is presented in Figure 3.3. One can see, that although optical flow maps remain being the input data for the neural network feature calculator, the structure of this input has changed which entails a change in a post-processing stage, as well. Thus, let us discuss the main steps of the approach in the following subsections. Figure 3.3 — The pipeline of the pose-based gait recognition algorithm 3.2.1 Body part selection and data preprocessing Similarly to the baseline model, the optical flow between consecutive video frames is calculated in advance to be used for feature training and extraction. OF maps are stored as images, as well, however, the structure of these images is being changed. In the baseline model two channels of the image correspond to horizontal and vertical components of the optical flow field and the third one is set to zero. The modified version of these images contains the same first and second channels, but the third one carries the magnitude of OF vector. Although this value correlates with the first two, this additional information helps the neural network to learn motion features. The main innovation of this approach is the detailed considering of point movements in different areas of human figure. Instead of OF blocks given up in this investigation, the patches of optical flow from different body parts are inputted into the network. The set of these body parts is selected for natural and empirical reasons. As explained in section 3.1, the legs are supposed to be the most informative part of human figure, thus, two patches of feet neighbourhood are chosen for recognition. However, these patches cover quite small part of the figure which is obviously not enough, so two “middle” areas are added to the set: lower and upper body parts, and one “big” patch corresponding to the full body. Bounding boxes of these body parts are shown in Fig. 3.4. Such a hierarchical body part set allows to consider the motion 49 all over the figure as well as in smaller areas to mine more precise information. The idea of considering points motion maps in the neighbourhood of human body parts is novel comparing to other gait recognition approaches based on optical flow. Figure 3.4 — The bounding boxes for human body parts: right and left feet (small blue boxes at the bottom), upper and lower body (green boxes in the middle) and full body (big red box) In order to construct the required patches, the pose key points have to be found. For this reason, the pose of the human is estimated in each frame by finding joint locations by OpenPose [89] algorithm. This method extends the convolutional pose machines [100] approach based on applying of the cascade of several predictors specifying the estimations of each other. At each stage, there is a neural network with a small convolutional architecture predicting the heatmaps of joints locations. Such a step-by-step procedure allows to increase the receptive field and observe the whole image as well as local features leading to prediction refinement. After the pose estimation stage, the frames in each video can be filtered according to person’s presence in the frame. Since exactly the human motion is considered, only frames containing the full figure are taken into account. So, sometimes several frames at the beginning and the end of the video (where the person is not in the frame entirely) are not used. Having the key points found, the patches for network training are chosen in the following way. The leg patches are squares with key foot points in their centers, the patch should cover the whole foot including heel and toes, thus, the side of the square is empirically chosen to be 25% of humans height. The upper body patch contains all of the joints from the head to the hips (including the hands), and the 50 Figure 3.5 — The visualizations of optical flow maps in five areas of the human figure lower body patch contains all of the joints from the hips to the feet (excluding the hands). Thus, each pair of consecutive frames produces five patches for use as network inputs (see the visualizations of these patches in Fig. 3.5). This procedure finishes the preprocessing step and data preparing. 3.2.2 Neural network pipeline The fundamental component of the proposed algorithm involves the extraction of neural features. In this section, the details of neural network architectures, training process and feature extraction are provided. Data augmentation As the main difficulty of neural network based gait recognition is lack of training samples, it is necessary to augment the data to decrease overfitting and make algorithm more stable. The data is augmented using conventional spatial transformations. In the training process, four random integers are uniformly sampled for each of the five considered body parts to construct the bounds of an input patch. The first two numbers are left and right extensions chosen from the interval [0, 𝑤/3], where 𝑤 is the width of the initial bounding box, while the latter two numbers are upper and lower extensions in [0, ℎ/3], where ℎ is the height of the box. Two cropping methods are applied with equal probability: with or without zero padding. If the extended box goes beyond the frame borders, the patch can be supplemented by zero values in the corresponing side, or cropped as is, which leads to patch stretching. The resulting bounded patches are cropped from the OF maps and then resized to 48 × 48. This augmentation allows us to obtain patches containing body parts undergoing both spatial shifts and zoom. If the sum of the sampled numbers is 51 close to zero, a large image of the body part with a very little excess background is obtained (see the left example on Fig. 3.6); otherwise, the image is smaller with more background (the second left example). On the other hand, fixing the sums of the first and the second pairs of bounds produces body parts with the same size but changed location inside the patch (two right examples on Fig. 3.6). Figure 3.6 — The frame with human figure bounding box (red) and the border of probable crops (green) depicted and the examples of patches that can be cropped from this one frame Neural network architectures and training process When the optical flow patches are prepared, they are used for neural network training. An important feature of the proposed model is that the same network with the same weights is used for each of the body parts. The patches are inputted into the network interspersed with each other and no stream separation is needed. Moreover, being applied for multi-view recognition, one general network is used for all the available views, no preliminary separation is made and no information about the viewing angles is required while training and inference. Unlike many other methods (e.g. based on GEI) that construct the features depending on the view, the proposed model takes all frames as input regardless the angle. Two convolutional architectures are considered and compared for this approach. The first one is the best of baseline model architectures, truncated VGG-like network, the details of this architecture can be found at the page 36. Applying the same architecture, we can make an exact comparison of the new pose-based model with the baseline one and see if considering body parts really improves the recognition. The staged training process described in section 2.2.2 remains the same with layer step-by-step widening and extra weights regularization. 52 However, despite data augmentation, dropout layer and weight regularization, the quality on the testing set still gets behind the training one disclosing the overfitting. Thus, the second architecture is considered not used in the previous research. Unlike all the networks described earlier, which contain both convolutional and dense layers, a fully convolutional model, namely ResNet [101], is proposed for the task. This architecture contains residual connections which allow to transfer of information from low to high levels, and thus the addition of each new block increases the accuracy of classification. Besides this, the absence of fully connected layers in the ResNet architecture reduces the number of parameters of the model which also leads to the decrease of overfitting. Due to such structure, ResNet have become one of the more successful architectures for image classification task and, hence, a very popular and inspiring generally in computer vision. Despite all the advantages, ResNet architectures are usually very deep requiring large number of layers for performance improvement. And each new block significantly increases number of parameters and lengthens the training process. Another problem is exploding or vanishing gradients as a result of very long paths from the last to the first layer during backpropagation. To avoid these problems, the Wide Residual Network [102] with decreased depth and increased residual block width is chosen; the resulting reduction in the number of layers made the training process much faster and allowed the network to be optimized with less regularization. The details of the considered Wide ResNet architecture are presented in Table 3.1. Table 3.1 — Wide Residual Network architecture B1 3 × 3, 16 BatchNorm B2 B3 ⎫ ⎪ BatchNorm ⎪ ⎪ ⎪ ⎪ ⎬ 3 × 3, 64 ×3 BatchNorm ⎪ ⎪ ⎪ ⎪ ⎪ ⎭ 3 × 3, 64 ⎫ ⎪ BatchNorm ⎪ ⎪ ⎪ ⎪ ⎬ 3 × 3, 128 ×3 BatchNorm ⎪ ⎪ ⎪ ⎪ ⎪ ⎭ 3 × 3, 128 B4 ⎫ ⎪ BatchNorm ⎪ ⎪ ⎪ ⎪ 3 × 3, 256 ⎬ ×3 BatchNorm ⎪ ⎪ ⎪ ⎪ ⎪ 3 × 3, 256 ⎭ Each column in the table defines a set of convolution blocks with the same number of filters. Similarly to VGG architectures, all of the convolutions have kernels of size 3 × 3. The first layer has 16 filters and is followed by batch normalization and then three residual blocks, each comprising two convolutional layers with 64 filters with normalization between them. This block construction is classical for Wide ResNet architectures, and all of the blocks have similar structure. In each 53 successive group the blocks have twice as many filters as in the previous one. Each first convolutional layer in groups B3–B4 has a stride with parameter 2; therefore, the tensor size begins at 48 × 48 pixels and decreases to 24 × 24 and then 12 × 12 in groups B3 and B4, respectively. Figure 3.7 — Consturction of residual blocks in Wide ResNet architecture The detailed construction of residual blocks is shown in Fig. 3.7. The scheme on the left corresponds to a common block without strides when the input and output shapes coincide; that on the right defines the structure of the first block of the group (columns B3 and B4) as the number of filters doubles and the size of the map decreases. In this case, we add one auxiliary convolutional layer with a 1 × 1 kernel, stride, and doubled number of filters to make all of the shapes equal before the summation. All the batch normalization layers are followed by ReLU activations. In order to be classified, after going through all the residual blocks, the tensor has to be transformed into 1-dimentional vector with number of components equal to number of subjects which is the probability distribution over the subjects. For this purpose, BatchNorm layer is added followed by average pooling which “flattens” the sequence of maps 12 × 12 obtained following the convolutions to one vector. After that, the last dense layer with softmax nonlinearity is applied. Both networks were trained using Nesterov Momentum gradient descent with the learning rate reduced from 0.1 by a factor of 10 each time the training quality stopped increasing. Network training was implemented in Lasagne framework [103] with a Theano backend on an NVIDIA GTX 1070 GPU. 3.2.3 Feature aggregation and classification The network is trained using the augmented data and then used as a feature extractor, with the outputs of the last hidden layer (AvgPooling) used as 54 gait descriptors. Although the model is trained to determine the subject identity according to the patch of optical flow in any of considered body areas, in the testing phase, the features are extracted for each frame and for each body part and aggreagted into one pose-based gait descriptor. Let us explain the how this final descriptor is obtained. First of all, instead of sampling random bounds for each patch, their mean value is taken (𝑤/6 and ℎ/6, respectively) to locate the body part in the center of the patch. Then, considering 𝑗 body parts one can obtain 𝑗 patches for each pair of consecutive frames and therefore (𝑁 − 1)𝑗 descriptors for the video, where 𝑁 is the number of frames in the sequence. Two methods for constructing one feature vector from all network outputs are investigated. The first, a “naive” method, involves averaging all of the descriptors by calculating the mean feature vector over all frames and all body parts. This approach is naive in that the descriptors corresponding to different body parts have different natures; even if we compute them using one network with the same weights, it would be expected that averaging the vectors would mix the components into a disordered result. Surprisingly, the accuracy achieved using this approach is very high, as will be shown through comparison with another approach in the next section. The second and more self-consistent approach is averaging the descriptors only over time. After doing so, 𝑗 mean descriptors corresponding to each of 𝑗 body parts are obtained and concatenated to produce one final high-dimensional feature vector. The obtained gait descriptors are used for further classification. Similarly to the baseline model, the descriptors corresponding to the same person are assumed to be spatially close to each other, thus, the Nearest Neighbour classification method is applied. The descriptors are normalized according to their Euclidean length and projected into low-dimentional subspace using PCA algorithm prior to fitting and classifying, and the closest one from the gallery is searched for each testing object. As well as in section 2 both 𝐿1 and 𝐿2 metrics were considered and the first one produced higher recognition quality. The details and the results of the experiments evaluating the described approach are presented in the next section. 55 3.3 3.3.1 Experimental evaluation Performance evaluation The general protocol of the experiments coincides with one described in section 2.3. Experiments were conducted on all three available databases: TUM­ GAID [7], CASIA [9], and OULP [8]. The first one and 90° subset of the second one are used for side-view recognition while full set of CASIA and OULP are employed in multi-view experiments. All the data of each collection is separated into two parts, one is used for feature extractor training and the other one remains for evaluation. Except Rank-1 and Rank-5 metrics defined in the first chapter, the cumulative match characteristics (CMC) curve was plotted to compare different techniques more clearly. Additionally to identification experiments the verification quality of different methods was compared. All the videos for testing subjects were divided into training and testing parts the same way as in the identification task. For each pair of training and testing videos the algorithm returns the distance between the corresponding feature vectors and predicts if there is the same person on both video sequences according to this distance. To evaluate the quality of the verification we plotted the receiver operating characteristic (ROC) curves of false acceptance rates (FAR) and false rejection rates (FRR) and calculated equal error rates (EERs). 3.3.2 Experiments and results The goal of the following experiments is to explore the influence of different conditions on gait performance, including: – network architecture and aggregation methods; – body parts used for training and testing the algorithms; – length of the captured walk; – differences in training and testing datasets. The results are compared with those produced during the few years before this approach including the Fisher vectors based model [49] and multimodal features (RGB frames, audio and depth) [104] which had demonstrated state-of-the-art results in their time. The first set of experiments aims at evaluating the method itself and comparing different technical approaches such as network architectures, aggregation methods, 56 and similarity measures. To start with, the side-view recognition experiments were conducted on TUM and CASIA collections. All the algorithms were trained from scratch. The results of side-view recognition on TUM GAID dataset are shown in Table 3.2, which makes it obvious that the proposed approach is the most accurate outperforming all the state-of-the­ art methods. [47; 90; 105; 106] Table 3.2 — Recognition quality on TUM-GAID dataset Method Architecture & Embedding Aggregation VGG VGG VGG VGG + + + + PCA PCA PCA PCA Wide Wide Wide Wide ResNet ResNet ResNet ResNet (1000) (1000) (500) (500) + + + + PCA PCA PCA PCA Baseline model, VGG CNN + SVM [90] DMT [105] RSM [106] SVIM [47] DCS [104] H2M [104] PFM [49] (230) (230) (150) (150) Metrics Evaluation Rank-1 [%] Rank-5 [%] avg avg concat concat 𝐿2 𝐿1 𝐿2 𝐿1 96,4 97,8 97,4 98,8 100,0 100,0 99,9 100,0 avg avg concat concat 𝐿2 𝐿1 𝐿2 𝐿1 98,3 99,2 98,8 99,8 99,9 99,9 99,8 99,9 𝐿1 𝐿2 97,5 98,0 98,9 99,9 99,6 — 92,0 84,7 99,2 99,2 99,2 — — — — 99,5 In Table 3.2, “avg” denotes a naive approach for feature aggregation in which the mean descriptor is computed over all body parts, while “concat” defines the concatenation of descriptors, which produces better results. It is worth noting that the best in its time PFM [49] approach takes input frames of initial size 640 × 480 without resolution change, while the inputs of the proposed method are reduced requiring less time and memory for processing. However, the quality of the algorithm not only achieved the same level as PFM 57 but even outperformed it. The Wide ResNet architecture shows the best recognition accuracy while having far fewer parameters than VGG-like one, thus this architecture is supposed to be more appropriate for gait recognition task and is taken as the main one in further experiments. Since the appearance and shape should not affect the gait representation, the recognition quality under different conditions presented in the dataset is evaluated as well. The results reflected in Table 3.3 confirm that proposed approach is stable to the variations of shoes and carrying conditions and the model accuracy remains quite high with small changes of silhouette and appearance. Table 3.3 — Recognition quality on TUM-GAID dataset under different conditions Method Pose-based model Baseline model, VGG CNN + SVM [90] DCS [104] H2M [104] Normal Backpack Shoes Avg 100,0% 99,7% 99,7% 99,7% 100,0% 96,5% 97,1% 99,0% 99,4% 96,5% 97,1% 99,0% 99,8% 97,5% 98,0% 99,2% 99,4% 100,0% 98,1% 99,2% Side-view recognition investigations continue with CASIA 90° view experiments. To affirm the superiority of pose-based method over the block-based one, the comparison of these two methods is made and the results are shown in Table 3.4. Despite the small size of training set unsufficient for construction of the stable VGG-like baseline model, the novel approach with lightweight Wide ResNet backbone achieves quite high recognition accuracy on testing part of the collection and shows less overfitting. Table 3.4 — Recognition accuracy on CASIA lateral part Architecture Aggregation Wide ResNet + PCA (150) Wide ResNet + PCA (130) Wide ResNet + PCA (170) Wide ResNet + PCA (170) Baseline model, VGG avg avg concat concat Metrics Rank-1 [%] 𝐿2 𝐿1 𝐿2 𝐿1 𝐿1 85,1 86,7 84,8 93,0 74,9 Similarly to Table 3.3, Table 3.5 shows how different clothing and carrying conditions presented in CASIA database influence on recognition quality. The data 58 variability in this collection is more complex than in TUM-GAID, as coat changes the silhouette and hides the joints more than coating shoes, thus the decomposition of results on this dataset into the components with different conditions is more informative. One can see, that while the presence of the bag does not worsen the recognition, putting on a coat actually does. Such situation is predictable as the coat is not just another clothing of different style. Unlike conventional indoor clothes, a coat is outerwear which hides human figure, changes its shape, and makes the motion of many points of the body invisible which leads to the change of inner gait representation. Thus such dress-up is a really challenging factor complicating gait recognition. However, the accuracy decrease is not dramatic and even the in-coat part-based recognition outperforms average baseline results. Table 3.5 — Recognition quality on CASIA lateral part under different conditions Method Wide ResNet, 𝐿1 Baseline model, VGG Normal Bag Coat Avg 100,0% 100,0% 78,8% 93,0% 94,5% 78,6% 51,6% 74,9% To investigate the general algorithm applicability, multi-view gait recognizability has to be verified. For this purpose, one more experiment has been conducted on CASIA database. Unlike all previous experiments, this one has covered not only the lateral-view part of the collection, but full dataset including 11 different viewing angles. Several multi-view testing protocols are suggested for this dataset, the one proposed in [60] was chosen here. Following this protocol, the network is trained on all the available data for the first 24 subjects (10 videos per subject shot at all 11 angles) and the evaluation is made on the rest 100 subjects. Table 3.6 shows the average cross-view recognition rates at three probe angles: 54, 90, and 126 degrees. The gallery angles are all the rest 10 angles (excluding the corresponding probe one). We compare our results with [60] showing the state-of-the-art result on CASIA and SPAE [64] where one uniform model is trained for gait data with different viewing and carrying condition variations. For two of three considered angles, pose-based method achieves the highest accuracy and even outperforms the other methods on the average. To complete the evaluation of multi-view recognizability, the experiments on OULP collection [8] have been conducted. Additionally to verifying the applicability of the proposed approach to multi-view data, these experiments aim at investigation 59 Table 3.6 — Average recognition rates for three angles on CASIA database Model Average Rank-1 [%] Probe View 54° 90° 126° Mean WideResNet + PCA (230), concat, 𝐿1 77,8 68,8 74,7 73,8 SPAE [64] 63,3 62,1 66,3 63,9 Wu [60] 77,8 64,9 76,1 72,9 of recognition possibility in case of partial information. Since OULP is distributed in a form of binary silhouette masks, only the movements of external edges are computable, and it is worth exploring if such information is enough to human identification. Two testing protocols are proposed for this gait dataset. The first one uses the subset of 1912 subjects and 5 different splits of this subsets in half provided by the authors. The final recognition metrics is obtained by averaging the results over these splits. The quality is evaluated for each pair of probe and gallery angles, but to be short, 85° is usually presented as the gallery view while the probe views are comprised all other angles. The second protocol is based on five-fold cross-validation (cv) on all the available 3,844 subjects shot by two cameras. In this protocol, the results are usually aggregated over the angular difference between probe and gallery views. In both cases, the final classifier is fitted on the gallery frames from the first walk and tested on the probe frames from the second walk. Since different researchers conduct their experiments following different protocols, both of them were used to make the investigation thorough. Firstly the proposed method was compared with several successful deep and non-deep approaches: view transformation models wQVTM [43] and TCM+ [44], methods LDA [38] and MvDA [107] based on discriminant analysis, bayesian approaches [36; 56], and GEINet [59]. To make the full comparison, the quality of all methods was evaluated not only for conventinal classification task, but also for verification. Table 3.7 presents Rank-1 and EER metrics while the graphical comparison is shown in Fig. 3.8. The curves for all the other methods except the proposed one were provided by their authors. 60 Table 3.7 — Comparison of Rank-1 and EER on OULP dataset Rank-1 [%] Model EER [%] Probe view 55° 65° 75° 55° 65° 75° 92,1 96,5 97,8 0,9 0,8 0,8 95,9 97,8 98,5 0,6 0,5 0,5 GEINet [59] LDA [38] MvDA [107] wQVTM [43] TCM+ [44] DeepGait + Joint Bayesian [56] 81,4 87,5 88,0 51,1 53,7 89,3 91,2 96,2 96,0 68,5 73,0 96,4 94,6 97,5 97,0 79,0 79,4 98,3 2,4 6,1 6,1 6,5 5,5 1,6 1,6 3,1 4,6 4,9 4,4 0,9 1,2 2,3 4,0 3,7 3,7 0,9 Joint Bayesian [36] 94,9 97,6 98,6 2,2 1,6 1,3 Wide ResNet, 𝐿1 Wide ResNet, 𝐿2 extended training set The second row of Table 3.7 shows the quality of the model trained on all available subjects except 956 testing ones. Such extended training set containing 2,888 subjects allowed to improve the recognition and outperform the state-of-the­ art method for some view angles. Despite being very high the results achieved using this extended set are not highlighted as they are obtained following the other training protocol. Although the recognition accuracy of the proposed method is a bit lower than the best one [36], the quality of verification turns out to be higher. Optical flow based method has the lowest EER and the corresponding ROC lies below all other curves. To compare the algorithm with two more state-of-the-art cross-view recognition methods [60; 61], the second evaluation protocol was used. The accuracy comparison with these two approaches is presented in Table 3.8. The cross-validation results confirm the ability of the optical flow based method to recognize people at different viewing angles. The accuracy of the algorithm is a bit lower than [61], but outperforms [60]. Besides, as mentioned above, unlike these methods the proposed one does not require the information whether the frames have been shot at the same or different viewing angles. Thus, the experimental results revealed that the algorithm can be generalized to multiview and can obtain rather high accuracy even in case of partial information. 61 Figure 3.8 — ROC (the first row) and CMC (the second row) curves under 85° gallery view and 55°, 65°, and 75° probe view, respectively Table 3.8 — Comparison of Rank-1 on OULP dataset following five-fold cv protocol Rank-1 [%] Model Angular difference 0° 10° 20° 30° Mean Wide ResNet, 𝐿1 98,4 98,2 97,1 94,1 97,0 Takemura [61] 99,2 99,2 98,6 97,0 98,8 Wu [60] 98,9 95,5 92,4 85,3 94,3 LDA [38] 97,8 97,1 93,4 82,9 94,6 They show that even the absence of any texture inside the human figure keeps recognition possible and finish the first set of investigations. The second part of experimental process concerned the human figure, namely, the body parts that are most important for gait recognition. As mentioned in section 3.1, different parts of the body are supposed to affect the gait representation differently and the lower part seems more important for recognition. Nevertheless, the results of OULP dataset recognition revealed that full-body features are quite informative, therefore, different sets of body parts were compared. Additionally to full set of five parts introduced in section 3.2.1, each separate body part was 62 considered (full body, pairs of legs patches, upper, and lower body) and a set of “big and middle” parts: full body, upper body, and lower body. The latter set was chosen in order to check if the presice consideration of small leg areas is really necessary or big patches are enough. For each of the sets a new model was trained based on corresponding body parts and then evaluated using concatenation as pose aggregation method and 𝐿1 metrics as distance measure. The comparison of these models on TUM dataset is presented in Table 3.9. Table 3.9 — Recognition quality of models based on different sets of body parts on TUM-GAID dataset Body parts Rank-1 [%] Rank-5 [%] legs 79,7 86,5 upper body 96,2 99,7 lower body 96,3 99,6 full body 98,9 100,0 full body, upper body, lower body 99,4 100,0 full body, upper body, lower body, legs 99,8 99,9 According to the top rows of the table, the larger the considered area is the higher accuracy is obtained. Full body based model is the most successfull among “one-part-based” models confirming that the corresponding patch is, indeed, the most important one. However, this model is not perfect and makes a mistake in more than 1% cases. Even composition of three “big and middle” patches (full body, upper body, and lower body) does not improve the quality completely despite outperforming each of three distinct models. The best result is obtained by five­ parts-based model leading to the conclusion that the precise consideration of legs surroundings is not excessive and improves the classification. The third avenue of interest was determining the length of video needed for good gait recognition. In all of the preceding experiments, entire video sequence was used, totaling up to 130 frames. Tables 3.10 and 3.11 list the results of testing the algorithms on shortened sections of TUM-GAID and CASIA sequences (following both multi-view and side-view protocols). This experiment has not been conducted on OULP dataset because the period of constant view is about 30 frames in its videos which is already quite short. Thus, the further shortening was supposed worthless. 63 Since the network in trained on distinct maps without any aggregation, it does not depend on the length on the sequences and the same model can be used in this experiment. The only change is made while testing: instead of entire sequence only their first 𝑛 frames are used for feature calculation. We consider 𝑛 = 50, 60, 70 for TUM-GAID dataset and 𝑛 = 50, 70, 90, 110 for CASIA-B database as all the sequences there are longer. Due to the fact that all the frames are preliminarily filtered to contain fully visible human figure and the considered sections are still longer than one gait cycle, the starting and finishing positions are not important and do not influence on the recognition. Table 3.10 — Recognition quality for different video lengths on TUM-GAID dataset Length of video Rank-1 [%] Rank-5 [%] 50 frames 94,3 94,9 60 frames 97,5 97,8 70 frames 99,4 99,5 full length 99,8 99,9 Although the length of the gait cycle is about 1s, or 30 − 35 frames, such short sequences are found to be insufficient for good individual recognition. The experiments conducted on two datasets verify that increasing the number of consecutive frames for classification improves the results. The expanded time of analysis is required because body point motions are similar but not identical for each step and, correspondingly, using long sequences makes recognition more stable to small inter-step changes in walking style. It is worth noting that video sequences may consist of a non-integer number of gait cycles but still be accurately recognized. Similarly to baseline model, in a final set of experiments, the stability and transferability of the proposed algorithm is examined. To check if the model can be generalized, it was trained on one dataset and evaluated on the other one without any fine-tuning. Since OULP collection differs from two coloured datasets, the algorithms are obviously not transferable between them, thus, this database was not used in this experiment. Besides this, only side-view recognition can be investigated because there are no other angles in TUM database. Nevertheless, such a restricted model transferability was considered and the result of cross-dataset eveluation is shown in Table 3.12. The recognition quality deteriorated following the dataset change, but not so dramatically. Let us notice, that the accuracy of transferred pose-based 64 Table 3.11 — Recognition quality for different video lengths on CASIA-B dataset Video length Cross-view Side-view Average Rank-1 [%] Rank-1 [%] Probe view 54° 90° 126° Mean 90° 50 frames 66,85 60,95 64,05 63,95 90,33 70 frames 67,05 62,70 66,50 65,42 91,38 90 frames 67,10 66,95 69,95 68,00 92,42 110 frames 75,25 68,45 73,20 72,30 92,68 full length 77,80 68,80 74,70 73,77 92,95 Table 3.12 — The quality of transfer learning Testing Set Training Set CASIA TUM CASIA 92,7% 66,5% TUM 76,8% 99,8% algorithm on CASIA database is higher than the accuracy of directly applied baseline algorithm (compare with Table 2.5). Nevertheless, because of changing the training dataset, the recognition accuracy decreases by 17% for CASIA dataset and by about 33% for TUM. This suggests that the algorithm overfits and that the amount of training data (particularly in the CASIA dataset) is not sufficient for constructing a general algorithm despite the smaller size of the network. 3.4 Conclusions Conducted research and the results of the experiments can lead to the following conclusions. First of all, high recognition quality of the model based on distinct full-body patches reveals that optical flow itself contains enough information for recognition by motion and long temporal blocks are excessive. It allows to make the model much lighter and recognition faster. Secondly, the most important information 65 for gait recognition is the movement of external edges reflected in the optical flow applied to silhouette sequences. Thus, the proposed method can be applied even in case of partial information. Nevertheless, collecting additional information from regions around the joints improves the results, surpassing the state-of-the-art approaches. 66 4. View resistant gait recognition According to previous research, single-view recognition turns out to be much easier than cross-view one. The quality of identification when the viewing angles are close to each other is higher, which leads to the need to modify the models and make them more stable to view changes. Current chapter is devoted to the methods of view resistance increasing, two approaches are proposed for this, analyzed and evaluated experimentally. 4.1 Motivation View variation is one of the main difficulties in gait recognition problem. This factor changes only inner gait representation, i.e. it is obvious for human’s eye that there is the same person captured under different angles (see Fig. 4.1), nevertheless computer gets two absolutely different video sequences and it is a complex challenge to make a model that understands that these sequences correspond to the same person. Even when all the other conditions remain (the same background and environment, lightening, camera settings, clothes, shoes and carrying objects), the change of viewing point leads to different feature generation and the greater the difference between the views is, the further gait descriptors are from each other making the classification more complex. Considering Nearest Neighbour method Figure 4.1 — The example of the same walk captured under different viewing angles as the final classifier of gait descriptors, one needs to make the descriptors of the same subject as close to each other as possible regardless the viewing angle the 67 video had been captured under. At the same time, the vectors corresponding to different subjects should be far from each other in order the spatially close vectors to have the same labels. Despite the mismatch of formal definitions, such a problem formulation is quite close to clustering. Indeed, many clustering techniques (e.g. Linear Discriminant Analysis [38]) aim at making the features of the same subjects closer while making the features of different subjects further from each other. Figure 4.2 — The desired cluster structure of the data According to these reasonings, the features trained by neural network should be given a cluster structure as shown in Figure 4.2. In this chapter, two techniques are proposed for this purpose. One of them is View Loss, it affects the structure of the descriptors while neural network training by penalizing for view dependence, and the second one is cross-view triplet probability embedding (CV TPE) inspired by TPE [98] approach introduced in section 2.2.3. It modifies the features after extraction from the network and decreases the view dependency, as well. Although both methods are model-free and can be applyed to any algorithm, here they are intergarated into developed pose-based gait recognition method. The pipeline of the algorithm with these two modifications is shown in Figure 4.3. All the stages of the algorithm except two novelties remain the same as describes in chapter 3. The details of both approaches are described in the following sections. 68 Figure 4.3 — The pipeline of gait recognition algorithm with two modifications (view loss and CV TPE) shown in green boxes 4.2 View Loss To overcome the problem of multiview recognition, one needs to decrease the dependence of the gait features on the view. Thus, the main goal is to train the model so that the features of the same person are similar for different views. For this purpose, the penalty is proposed for the features of one subject too different from each other. This penalty is implemented as the special loss function added while network training. Each of neural networks introduced in previous sections had been trained for classification task to predict the probability distribution over the set of training subjects. The conventional classificational loss function is cross-entropy, or logarithmic loss (LogLoss) defined as 𝐿𝑜𝑔𝐿𝑜𝑠𝑠 (𝑝, 𝑦) = − 𝑛 ∑︁ 𝑦𝑖 log (𝑝𝑖 ) , 𝑖=1 where 𝑝 is a predicted distribution vector and 𝑦 is one-hot encoded ground truth. Additionally to this loss, the auxiliary View Loss is proposed to prevent the large distance between features of the same subject obtained under different views. To make the vectors close to each other regardless of the viewing angle, it penalizes for the difference between the representations of videos with one certain view and averaged representation of videos with different views. Thus, instead of direct optimization (︀ )︀ ρ 𝑑α , 𝑑β → min ∀α, β, where 𝑑α , 𝑑β are descriptors of videos with views α and β, respectively, and ρ is any distance metrics, the following task is considered (︃ )︃ 𝑚 𝑚 ∑︁ ∑︁ 1 1 β ρ 𝑑α𝑗 , 𝑑𝑗 𝑗 → min ∀α, β𝑗 , (4.1) 𝑚 𝑗=1 𝑚 𝑗=1 69 β where all the descriptors 𝑑α𝑗 have the same view α while vectors 𝑑𝑗 𝑗 may be obtained under different viewing angles β𝑗 . In details, let 𝑖 be an ID of a subject, α be an angle. Let us consider two data batches: the first one {𝑏𝑖,α } consists of the data for subject 𝑖 with view α, and the second one {𝑏𝑖 } consists of data for the same subject 𝑖 but captured under any angle. Let 𝑑𝑖,α be the average of last hidden layer outputs of the first batch, and 𝑑𝑖 be the average hidden output of the second batch. Since the hidden representations are used as the gait features and the nearest one is searches for in the database, we need to bring 𝑑𝑖,α and 𝑑𝑖 closer. To achieve it, we consider such pairs of batches and the following loss function for them: 𝐿𝑟𝑒𝑔 = λ𝐿𝑣𝑖𝑒𝑤 + 𝐿𝑐𝑙𝑓 , (4.2) where 𝐿𝑣𝑖𝑒𝑤 = ||𝑑𝑖,α − 𝑑𝑖 ||2 is the described view loss corresponding to optimization task (4.1), 𝐿𝑐𝑙𝑓 = 𝐿𝑜𝑔𝐿𝑜𝑠𝑠({𝑏𝑖,α }) + 𝐿𝑜𝑔𝐿𝑜𝑠𝑠({𝑏𝑖 }) is conventional cross-entropy for classification applied to both batches, and λ is relative weight of view term in total loss. The view term added to conventional loss function can be considered as the regularization to prevent view memorizing. It is worth noticing, that different network outputs are used in these two loss terms. While the intermediate (hidden) representations are brought together in 𝐿𝑣𝑖𝑒𝑤 , the final outputs reflecting the probability distribution are taken in 𝐿𝑐𝑙𝑓 . Such a strategy aims to enhance the feature extractor character of the network. Since such batches have to be sampled in a special way (the subject ID is fixed inside the batch and the angles are fixed in some batches, as well), the network can not be trained exclusively on such batches. So, the conventional optimization steps based on random batches are alternated with these regularizing steps. The network makes “view optimization” once in 𝑘 steps. The whole algorithm for one epoch appears to be as follows: for 𝑗 in range (epoch_size) do Sample random batch; Make optimization step: 𝐿𝑐𝑙𝑓 → min; if 𝑗 mod 𝑘 = 0 then Sample random subject 𝑖, random view α; Sample batches {𝑏𝑖,α }, {𝑏𝑖 }; Make regularized optimization step: 𝐿𝑟𝑒𝑔 → min; end if 70 end for Such regularized optimization steps obviously increase the unregularized loss but prevent the fall into the local minimum. However, firstly the network has to be trained in a conventional way and achieve any “optimal point”. Thus, the following training process has been implemented: 1. Unregularized training, until LogLoss stops decreasing; 2. Regularized training with λ = 10−3 , 𝑘 = 20 until both classification and view losses stop changing; 3. Unregularized training, until LogLoss stops decreasing. The first step leads to the initial model described in chapter 3. However, when the loss stops decreasing, the addition of regularizer pulls the solution out from local minimum if it is not optimal and helps to change the optimization direction and to descent to “real” global minimum. The curve of logarithmic loss behaviour during training process is plotted in Figure 4.4. As was supposed, the model got into the local minimum after the first step of training, and the second regularized step has pulled it out from this minimum. Despite the increase of LogLoss due to extra loss term, the third step of optimization leads to the lowest position which was unreacheable without regularization. It is hardly noticeable on the graph, but the loss after the third step is 0,18 which is 13% lower than 0,21 after the first step. Figure 4.4 — The curve of the training loss during three steps of training process The decrease of the training loss offers the hope that the quality on test set improves as well. The confirmation of this assumption is presented in section 4.4. 71 4.3 Cross-view triplet probability embedding Although the training process described in previous section aims to get rid of view dependence, one more approach based on TPE [98] is proposed to increase view resistance more. Despite the fact that TPE was initially introduced for for verification task, it can successfully be applied for classification as shown in section 2.2.3. The details of TPE are explained in section 2. It is motivated by inequality (2.2) and reduced to conventional optimization problem (2.5). However, TPE was introduced without respect to multi-view recognition and does not take angles under consideration. Here, its modification is proposed that uses the information about the view while training and brings the features of the same subject with different angles closer to each other. Figure 4.5 — The desired mutual arrangement of the elements in a triplet In multiview mode, the features of the same object captured under different views have to be closer to each other than the features of different objects captured under the same angle, thus, the condition (2.2) turns to the following one 𝑆𝑊 (𝑣𝑎,α , 𝑣𝑎,β ) > 𝑆𝑊 (𝑣𝑎,α , 𝑣𝑛,α ), (4.3) where 𝑣𝑎,α , 𝑣𝑎,β are the features of the same object 𝑎 captured under different views (α ̸= β), and 𝑣𝑛,α corresponds to the other object captured under the same view α as 𝑣𝑎,α . The visualization of this condition is presented in Figure 4.5. In doing so, the optimization problem (2.5) transforms to a new one 2 𝑒−||𝑊 𝑣𝑎,α −𝑊 𝑣𝑎,β || log −||𝑊 𝑣 −𝑊 𝑣 ||2 → max . −||𝑊 𝑣𝑎,α −𝑊 𝑣𝑛,α ||2 𝑎,α 𝑎,β 𝑊 𝑒 + 𝑒 𝑣𝑎,α ,𝑣𝑎,β , ∑︁ 𝑣𝑛,α (4.4) 72 Similarly to initial embedding, parameter matrix 𝑊 is found by SGD method. Let us denote 𝑢 = 𝑣𝑎,α − 𝑣𝑎,β ; 𝑣 = 𝑣𝑎,α − 𝑣𝑛,α . (4.5) Since 𝑑 ||𝑊 𝑥||2 = 2𝑊 𝑥𝑥𝑇 , 𝑑𝑊 the gradient obtained by one triplet 𝑣𝑎,α , 𝑣𝑎,β , 𝑣𝑛,α equals )︁ 𝑑 (︁ 2 −||𝑊 𝑣𝑎,α −𝑊 𝑣𝑎,β ||2 −||𝑊 𝑣𝑎,α −𝑊 𝑣𝑛,α ||2 ||𝑊 𝑣𝑎,α − 𝑊 𝑣𝑎,β || − log 𝑒 +𝑒 = ∇𝑊 = − 𝑑𝑊 (︁ )︁ 1 −||𝑊 𝑢||2 𝑇 −||𝑊 𝑣||2 𝑇 𝑇 𝑒 (−2𝑊 𝑢𝑢 ) + 𝑒 (−2𝑊 𝑣𝑣 ) = − 2𝑊 𝑢𝑢 − −||𝑊 𝑢||2 + 𝑒−||𝑊 𝑣||2 𝑒 2 (︀ )︀ 𝑒−||𝑊 𝑣|| 𝑇 𝑇 − 2𝑊 𝑢𝑢 + 2𝑊 𝑢𝑢 + −||𝑊 𝑢||2 2𝑊 𝑢𝑢 − 2𝑊 𝑣𝑣 = 𝑒 + 𝑒−||𝑊 𝑣||2 𝑇 𝑇 2 (︀ )︀ 𝑒−||𝑊 𝑣|| 𝑇 𝑇 2𝑊 𝑢𝑢 − 2𝑊 𝑣𝑣 (4.6) 2 2 𝑒−||𝑊 𝑢|| + 𝑒−||𝑊 𝑣|| Thus, at each step of gradient descent, matrix 𝑊 is being updated according to the following rule: 2 (︀ 𝑇 )︀ 𝑒−||𝑊 𝑣|| 𝑇 𝑊 =𝑊 +2 · 𝑊 𝑢𝑢 − 𝑣𝑣 , −||𝑊 𝑢||2 + 𝑒−||𝑊 𝑣||2 𝑣𝑎,α ,𝑣𝑎,β , 𝑒 ∑︁ (4.7) 𝑣𝑛,α where 𝑢 and 𝑣 are denoted for each triplet 𝑣𝑎,α , 𝑣𝑎,β , 𝑣𝑛,α according to equations 4.5. Being embedded into the constructed subspace the representations become less view­ dependent and more id-dependent. The features of the same subject get closer even if they are obtained under different angles. To make the training more effective, hard negative mining has been implemented. At each step, the triplet is sampled as follows: Sample random subject ID 𝑎 and two random angles α, β; Sample anchor and positive element 𝑣𝑎,α , 𝑣𝑎,β ; 𝑑𝑚𝑖𝑛 = inf; for 𝑗 in range (negatives_count) do Sample random subject ID 𝑛 among all subjects except 𝑎; 73 Sample negative candidate 𝑣ˆ𝑛,α and calculate 𝑑𝑎,𝑛 = ||𝑊 𝑣𝑎,α − 𝑊 𝑣𝑛,α ||; if 𝑑𝑎,𝑛 < 𝑑𝑚𝑖𝑛 then 𝑑𝑚𝑖𝑛 = 𝑑𝑎,𝑛 ; 𝑣𝑛,α = 𝑣ˆ𝑛,α ; end if end for return (𝑣𝑎,α , 𝑣𝑎,β , 𝑣𝑛,α , 𝑑𝑚𝑖𝑛 ) where inf should be any large number and is set to 104 and negatives_count is number of negative candidates and is set to 200. Thus, the closest negative object in current subspace is selected to improve the trainability of the embedding and fasten the process. 4.4 Experimental evaluation To evaluate the influence of both developed approaches separate experiments have been conducted for them and after that the models have been united to check if View Loss and CV TPE are interchangeable or not. Since this research concerns multi-view recognition, TUM database is not appropriate for this purpose, thus, only CASIA dataset was used for the experiments. In the first experiment the influence of view loss was investigated. The network was trained in three stages as described in section 4.2. After each of the steps the validation has been conducted to evaluate the recognition quality on test set. Following the same protocol as for part-based model (see section 3.3.2), the aggregated accuracies were calculated for each probe angle. To compare the proposed approach with state-of-the-art models more thoroughly, the frontal view (0°) recognition results are presented additionally to 54°, 90°, and 126° angles considered earlier, since this view is one of the most challenging looking particularly different from the others. The numerical results of the evaluation are shown in first four rows of Table 4.1. Despite the fact that LogLoss on training data has increased after the second step, the accuracy on testing part after this step is higher for most of the angles. During the last step the loss on training set has decreased and achieved the smallest value, and the final quality on testing data is also the highest after this step. As noticed in section 2.2.3, feature normalization prior to kNN classifier application is usually very effective, which is confirmed in the third and the fourth rows of the table. 74 Cross-view triplet probabilistic embedding is trained separately from the network, based on any set of feature vectors. The embeddings were trained for two models: the initial pose-based model from chapter 3 and view loss based, to check if both approaches worth applying or one of them is enough for sufficient recognition quality. Besides this, the conventional TPE was trained to evaluate the influence of “cross-view” component. In all the cases the embeddings were trained on neural features for 24 training subjects (the same as for neural network training) and then applied to the test set. The comparison of all the models with state-of-the-art approaches is also presented in Table 4.1. It demonstrates the superiority of the combination of developed approaches over their separate usage. One can see, that training regularization by view loss increases the average accuracy by 5,4 percentage points (from 71,6% to 77%) and additional cross view embedding increases the quality by another 3 percentage points. The result turns out to be almost 2 percentage points higher than the accuracy of state-of-the-art GaitSet approach [63]. It confirms that two proposed concepts complement each other allowing to outperform other existing methods. Let us notice that there are two different rows corresponding to baseline part­ based model and view loss based model after step 1. Although these two models were obtained by the same procedure, the latter one (as well as all the models described in this chapter) was implemented in PyTorch neural network framework [108] and has achieved higher quality comparing to the first model implemented in Theano. 4.5 Conclusions In this chapter, two novel approaches that increase gait recognition view stability were presented. Both of these concepts are model-free and can be integrated in any neural network multi-view recognition model: view loss can be added to any loss function of any neural architecture, and the embedding is applicable to any model prior to feature classification. Being applied to developed pose-based model, each of the modifications leads to the improvement of data cluster structure and thereby increases recognition quality. Moreover, the joint application of two approaches improves the model even more, allowing to achive the state-of-the-art quality. 75 Table 4.1 — Comparison of average cross-view recognition accuracy of proposed approaches and state-of-the-art models Average accuracy [%] Probe angle Method 0° 54° 90° 126° View loss, step 1 59,7 80,1 68,9 77,7 View loss, step 2 53,4 79,8 70,3 81,4 View loss, step 3 64,4 81,8 72,6 81,1 View loss, step 3 + normalization 65,9 83,6 74,6 83,7 Part-based model + TPE 56,9 82,6 73,5 82,4 Part-based model + CV TPE 62,6 84,6 75,6 84,4 View loss, step 3 + CV TPE + normalization 69,3 86,3 75,8 86,0 - 63,3 62,1 66,3 Wu [60] 54,8 77,8 64,9 76,1 GaitSet [63] 64,6 86,5 75,5 86,0 SPAE [64] 76 5. Event-based gait recognition One of the main advantages of gait as an identifier is that it can be captured at a large distance and does not require high resolution and perfect exactness. Particular characteristics of appearance are not so important as the movements of the points. This feature makes gait recognizable not only in conventional videos with colored frames changing 25 − 30 times per second but in other data formats reflecting the points motion. One of such data formats is event stream obtained from Dynamic Vision Sensor (DVS) [109]. These sensors have gained popularity recently and often replace conventional cameras, thus, the recognizability of gait in the event stream is worth investigating. This chapter focuses on the possibility of application of developed gait recognition methods for event-based data and is structured as follows. In the first section, the description and important characteristics of DVS are presented, they are followed by the sections devoted to event stream processing and algorithm modifications, and in the end of the chapter the experimental evaluations are presented to verify the recognizability of such data. 5.1 Dynamic Vision Sensors Firstly, let us describe what DVS is and why it is increasingly being used instead of conventional cameras. These sensors are inspired by natural human retina, they similarly capture changes in light intensity on a per-pixel basis. The output of DVS is the stream of events each reflecting the fact that the intensity of the point with spatial coordinates (𝑥, 𝑦) has increased or decreased at time 𝑡. Once the magnitude of intensity change gets beyond some threshold, a positive or negative event is generated. Pixel intensity changes are recorded 103 times per second generating the asynchronous event stream at microsecond time resolution. Visually, this event stream can be presented as a point cloud (with (𝑥,𝑦,𝑡) 3d-coordinates and the color corresponding to the sign of intensity change). Mathematically, an event is a tuple (𝑥, 𝑦, 𝑡, 𝑝), where 𝑥 and 𝑦 are coordinates of the pixel accepting the integer values in range [0, 𝑊 ] and [0, 𝐻] (with 𝑊 × 𝐻 being the spatial resolution of the sensor), 𝑡 is a timestamp calculated in microseconds, and 𝑝 ∈ {1, 0} is the polarity of the event corresponding to the increasing (𝑝 = 1) or decreasing (𝑝 = 0) of pixel intensity. If the sensor is static the events are generated only by moving objects or when the lightening changes. Since the natural light usually changes 77 slowly and smoothly, and artificial lightening turns on and off relatively seldom, the majority of the events appear due to real motion, and the background points generate events rarely. This allows not to store the unnecessary information and reduce time required for data transfer. Besides this, due to high temporal resolution it can capture very fast motion which looks fuzzy on the frames obtained by conventional cameras. Additionally to temporal sensitivity, it is extremely sensitive to intensity and can capture very small changes invisible for digital cameras. The threshold for event generation depends on the exact device, but it is always very low and reacts to small brightness changes. Thus, the sensor can get information even in case of very bad lightening when conventional camera outputs almost black frames. One more advantage is very low power consumption relative to RGB sensors. DVS need less energy and hence do not require powerfull accumulators which makes them more lightweighted and easier to use. Due to all these features, dynamic vision sensors have become wide spread in many fileds. For example, they are currently being developed for home assistance systems (“smart home”) where energy saving is very relevant. 5.2 Data visualization Since the aim of the reasearch is to recognize the person by the walking manner we need to know the mutual arrangement of points and their relative changes rather than discrete events in distinct points. Hence, a visualization of the event stream is required to be able to deal with them similarly to conventional video frames. The reconstruction of an initial scene from the stream is a challenging problem, and at the time of the study there were no qualitative methods of reconstruction. Nevertheless, the flow itself contains much information that is enough for many computer vision applications. Thus, it is proposed to make a simple visualization which carries information about the content of video being easily calculated. Several ways of transferring the DVS signal into video are proposed by different researchers. One of them which seems the most natural and often used for event stream visualization (see, e.g. [77]) is selected in this investigation. To construct this visualization, a temporal window of a certain length is considered for each point and the sum of all the events is calculated for the point in this time interval (see Fig. 5.1). The length of the window is usually set to 0,04 seconds to get 25 fps 78 frame rate. The polarities of the events are normalized prior to summation to get ±1 values, thus, the negative and positive events occured very close to each other (during one time interval) can reduce each other preventing noise appearance. Due to such way of aggregation the sequence of frames is obtained with zero value in the areas where nothing happened and the more changes occured in a point the larger the module of corresponding value is. Thus, these frames can be interpreted as the measure of happenings. Figure 5.1 — The transformation of the event stream into a sequence of frames However, although the events obtained by the sensor are generated due to the changes in the scene and thus correlate with the motion of the points, they do not always follow the motion (because the lightening is still one of the reasons of event generations) and even following they do not show where a point has moved. It distinguishes the event stream from the optical flow which reflects exactly the apparent spacial movements of the points between the frames. Thus, the event stream visualizations do not provide information about motion directly, but they are supposed to contain enough data for recognition. It is worth noting that although these constructed images are not real frames and many appearance characteristics get lost, the moving human figure is visually recognizable in these images and this figure is the only moving object in the frame due to stationarity of the background and chosen aggregation method. The example of visualized frame is shown on the left side of Fig. 5.2. Having these event images one can apply different image-based methods of detection, pose estimation, and gait recognition to them and check the transferability of these methods to event streams. 79 Figure 5.2 — The examples of event-based image(left) and blob detection result (right) 5.3 Recognition algorithm In this section, the details of the event stream recognition algorithm are presented. Although the aggregation transforms the stream into the sequence of images which can be processed similarly to conventional video frames, these images have another meaning and the direct application of the developed methods is not always implementable. Thus, to transfer the algorithm to such data, some modifications have to be implemented. The whole pipeline of the algorithm is shown in Figure 5.3 and almost replicates the conventional pose-based algorithm. The following subsections focus on the distinct steps of the algorithm that are modified or need some explanations due to the change of source data, namely 1. human figure detection; 2. pose estimation; 3. optical flow estmation. The rest of the operations are implemented as explained in section 3.2. Figure 5.3 — The pipeline of the event-based gait recognition algorithm 80 5.3.1 Human figure detection The first thing that should be done to recognize a subject is to detect it. Thus, the detection algorithm for event-based data is the topic of the first subsection. Since the sensors are usually applied for video surveillance (for example, in home assistance systems), they are fixed leading to the static background on the captured data. In this case, the points of the background do not generate the events and most of the events appear because of the moving objects, namely the people that have to be identified. Although human figure on the event-based images can be noticed visually and even look quite similar to the figure on the grayscale image, most of the modern complex neural detection methods applied “out-of-the-box” get confused considering such inputs and make mistakes. Nevertheless, due to the motionless background, the human figure is the only object leading to event generation, so, the “semantical” detection is not needed and can be reduced to the separation of a non-uniform human from the background that is almost monophonic. Thereby, a basic blob model was constructed which considers the average relative deviation of pixel brightness in each row and column of visualized stream from the median value and detects when these deviations exceed some threshold. Due to the lack of noise on the images the threshold is very low and is set to 0.4 − 0.45%. The rows and columns with the first and the last spikes of brightness are assumed to be the borders of human figure bounding box. Such a heuristic approach is very quick not requiring any training and works well when all human’s body parts move, but if some part of the body stops moving for a moment (it periodically happens to hands or feet) it becomes invisible and the figure can be detected not completely, so, to be sure that the boxes cover the whole figure we widen them by 10 percent each side. The example of detection is shown on the right side of Fig. 5.2. One can see that the legs are not easily separable from the background, thus the widening of the box is really needed. However, such approximate bounding boxes are indeed easy computable and enough for further human pose estimation. Depending on whether only full height figure or several body parts are used in the model these blob detections can be used as final boxes or as auxiliary data for pose estimation. It will be discussed in the next subsection that in the latter case the influence of detector mistakes on the final recognition decreases. 81 5.3.2 Pose estimation Due to the results of study described in chapter 4, more precise consideration of body parts increases identification accuracy compared to the full-body classification. Thus, it is assumed that such an approach can improve the recognition quality for event images, as well. Similar to human figure detection, the state-of-the-art methods trained for pose estimation in conventional RGB images do not cope with event-based ones. Hence, the existing models are to be modified to be applied to the event-based data. The Stacked Hourglass Network [110] was chosen since it is well-trainable and has an adequate number of parameters which makes it easily and quickly applicable. Stacked Hourglass Network is a neural model consisting of several “hourglass” residual modules allowing to capture information across different scales. This model does not solve the regression problem and approximates not the exact locations of human joints, but the heatmaps of their position, considering the Gaussian distribution centered at real joint coordinate as a target. The chosen architecture containing two stacked hourglasses was initially trained on the MPII Human Pose [111] dataset and achieves 86,95% of correct key points (PCKh). Being very successful Stacked Hourglass Network has one disadvantage compared to other state-of-the-art approaches: it processes only one person in a frame and requires an approximate location of this person as input. However, since the human figure detection turns out to be quite a simple procedure not requiring a lot of time and computational power, it is used for preliminary analysis, and the obtained bounding boxes of the figures (their center and side) are inputted into the pose estimation model along with the frames themselves. The main requirement to the boxes is that they should contain the target human figure inside and not contain any other figures. Since these conditions are usually satisfied, the detector allows estimating pose accurately enough for further body parts finding. Nevertheless, even knowing the approximate human body bounding box, this network does not find the correct locations of the joint on event-based images, and thus could not be used directly for body parts detection. To fine-tune the model, we have simulated event-based images for CASIA [9] dataset and annotated it automatically using the OpenPose library [89]. OpenPose method is actually more general than the Stacked Hourglass Network being able to estimate the pose of several people simulateneously. It does not require any bounding boxes, but finds all 82 the figures in a frame and for each person returns the coordinates of joint locations and their confidence. Nevertheless, this algorithm is more complex in training, thus only its inference procedure has been performed to get automatic annnotation, and more simple Hourglass model was taken for training event-based pose estimator. The details of the simulation algorithm are described in section 5.4. Because of the relatively small number of subjects and a large number of conditions CASIA database is very challenging for gait recognition. Nevertheless, since it contains thousands of frames where human full height figure is depicted, being labeled this dataset is perfect for the purpose of walking pose estimation. Annotating the CASIA database allowed to get a large DVS dataset labeled with pose key points that can be used for network training. Figure 5.4 — Examples of the skeletons estimated by fine-tuned Stacked Hourglass model The exact quality of the estimator whas not been evaluated as it is just an auxiliary problem, but visually the results are close to the ground truth. The examples of several skeletons evaluated for event-based images are shown in Fig. 5.4. The first two images correspond to almost perfect evaluation, all the key points are located correctly. Two other examples reflect the most common mistakes that occur while pose estimation: leg invisibility leading to “one-legged” skeleton and “noisy thorax” when specifically this key point turns out to be misdetected. Similarly to baseline conventional model, the skeletons are not inputted directly into recognition model, but are used to compute the “areas of interest” to consider the optical flow more precisely. Thus, the thorax key point can be thrown away, since its natural location is always strictly inside the box containing head, hands, and hips, thus this point should not affect the boundaries of the body part 83 bounding boxes. Such an heuristic helps to get rid of a “noisy thorax” problem. The “one-leg” problem is not as easy to solve as the previous one. First of all, this problem is much more common. Being on the ground the supporting leg does not move and, thus, no events are generated for a while in this area and the corresponding leg becomes invisible as shown on the third image of Fig. 5.4. And legs not only influence on the boundaries of the boxes, but completely define some of the boxes. For these reasons, the reduced set is considered containing only three big parts corresponding to upper and lower body parts and the full body. The full five-element set and full­ body model are considered, as well, to repeat the original model and compare the results. It is worth noting that despite the presence of human figure bounding boxes obtained from the detector, the full-body boxes in a pose-based model are computed anew according to the joint location. It allows to refine the detection and cut off the excess background that could be caught because of noise in event-based images. The Fig. 5.5 shows the difference between the initial bounding box and joint-based one. Figure 5.5 — Human figure bounding boxes after blob detection (left) and after refinement by pose estimation (right) 5.3.3 Optical flow estimation Since we transfer the developed OF-based method to event-based sequences, the main source of information is optical flow in the surrounding of the human figure. Although the visualizations constructed from the event stream are not real reconstructions, they do have natural sense. As mentioned in section 5.2, all the events occured in each spatial position during some time interval are aggregated while visualizing, thus, these frames can be interpreted as the measure of happenings. 84 Beside this, in the absence of overlapping it can be assumed that nothing moves inside the bounding box except the human figure itself, so, it is supposed to be the only moving object of the scene, and while the event aggregation the presence of the person in each pixel is measured. It makes the event frames similar to silhouettes which also reflect the presense of a human body in each pixel. It is shown in the previous study (see chapter 3) that optical flow computed from the binary silhouettes contain enough information for identification, therefore, the OF based on event frames is supposed to be recognizable, as well. As well as in all developed methods the optical flow is computed by Farnebäck algorithm [87], but unlike the previous approaches the third channel is not added to horizontal and vertical components of optical flow and the maps are not stored as images. In order to save all the information about pixel movements without any compression and discretization, the raw optical flow is stored in a binary format and normalized prior to cropping and inputting into the network. Fig. 5.6 shows horizontal and vertical components of such maps after normalization. Figure 5.6 — Two components of event-based optical flow maps 5.4 Event-based data To evaluate the proposed algorithm modification, the gait collection captured by DVS is required. Similarly to classical gait recognition, the lack of data is the most challenging problem. Moreover, unlike conventional case where some datasets exist and can be used, there are no publicly available event data appropriate for gait recognition task, thus, the event stream has to be simulated. Several ways of DVS data generation have been proposed by now, but depending on the purpose 85 of simulation, some of them are not applicable for gait recognition task. The most common simulation way is generating a short event streams by sensor motion around some static scene. However, the gait sequences need to be captured by the static sensor, hence, this approach is not appropriate. Besides this, since the gait is a very special motion, it is hardy generable from scratch, the initialization is required, thus it is proposed to simulate streams based on the existing gait collections. The data obtained by DVS has two main features: each event represents the fact of brightness increase or decrease, but not the value of change, and events are generated asynchronously with a very high frequency (several thousand times per second). Tracking of brightness differences between the consecutive frames is not a challenging task since one just need to compare the values of every pixel in each pair of frames. The main problem is the frequency of events as the frame rate of all the videos in gait datasets is 25 − 30 fps which is several orders of magnitude less than the sensor generates. In conventional low frequency video sequences each pixel shifts a lot between frames, so, the events cannot be correctly generated directly. To get real events intermediate frames have to be approximated. A simple linear approximation proposed in [112] is selected for our purpose. Firstly colored RGB frames are transformed to grayscale according to formula 𝑌 = 0.299𝑅 + 0.587𝐺 + 0.114𝐵, and then the intermediate intensity of each pixel is interpolated linearly between consecutive frames. Each time when the intensity changes by more than some value 𝐶, the new event is generated in the corresponding pixel. Despite the simplicity and naivety of this approach, the resulting event stream turns out to be rather close to the real data, and the images constructed by aggregation look quite natural. To estimate the simulation visually, several real recording were made by DVS and the frame sequences were constructed and compared. Figure 5.7 shows two event-based images. The simaluted data (on the right) has some more noise around the human figure, but this noise is hardly noticeable and all the details of the figure are even more clear than on the real frame (on the left). The visual comparison allows to assume that the simulation is close enough to really captured stream and the experiments can be conducted on such data. Two gait datasets were transformed into event streams for training and evaluation of the model. TUM-GAID database was chosen as the main dataset for the experiments due to its simplicity. Videos from this collection embody perfect situation when a person is visible from the side and there are no overlapping. 86 real data simulated data Figure 5.7 — The visualizations of the real and simulated event streams Nevertheless, considering that this study has been the first to its time to solve DVS gait recognition problem, such data could estimate the workability and efficiency of the proposed model prior to applying it to more complex real data. The second database, CASIA-B, was used for pose estimator training. Being very challenging for gait recognition due to small number of subjects, this collection contains 110 video sequences for each person and is a perfect set for pose key point detection. CASIA was also transformed into event-based form leading to about 1.7M labeled frames applicable for pose estimation. To accelerate the training process and make the data less complex, the dataset was filtered and only videos captured under the viewing angles from 54 to 144 degrees were taken, as in cases of more steep angle some body parts can get out of frame. Hence about 770 000 of annotated frames were used for pre-trained model fine-tuning. 5.5 Experimental evaluation The experiments conducted in this study aim to check if a person can be recognized by the gait in the event stream and how accurate such recognition is. In the experiments, the detection methods and the influence of different body parts on the recognition were compared. The evaluation protocol and metrics replicate the ones described in chapter 3 to be able to compare RGB and DVS models. Firstly the following basic model was examined. The classification network was trained to predict the person by the OF patches containing the full-height figure. The bounding boxes to be cropped were found by basic blob detector and the pose estimation model was not used at all. The pipeline of this model is presented at the 87 top of Fig. 5.3 (yellow and green blocks). The accuracy of this model is shown in the first row of Table 5.1. Since this study is the first to recognize the gait from event-based data there had been no state-of-the-art methods to compare the results with. However, regardless of any other approaches the recognizability of the data itself can be investigated and comparison with conventional video can be made. For this reason, the quality of the original method is presented in the last row of the table. The accuracy of event-based model is about 2% less than conventional one which demonstrates that tracking the events of brightness change instead of the total scene observation almost does not lose the information about the motion and, thus, does not deteriorate the recognition. The probable reason for this result is the lack of noise due to the dynamic nature of the data. The sensor reacts only to the changes that exceed some threshold ignoring too small spikes, and such “data cleansing” counterbalances the loss of effective information due to event consideration instead of real pixel brightness values. After the basic model was constructed, it was modified by adding the pose estimation prior to OF patch cropping. This advanced model is shown in yellow and blue blocks of Fig. 5.3. Due to the “one-leg” problem and not obvious usefulness of the optical flow around the legs two sets of body parts were considered: all five parts and three big parts of the body. The second and the third rows of Table 5.1 represent the comparison of these models with the corresponding approach based on RGB videos. The model based on three body parts achieves higher accuracy than the five-parts-based one. It presumably happened because of the redundant noise in the surroundings of the legs and probable misdetections of invisible legs in many frames. Note, that the best accuracy obtained by three big body parts is only 0,8% less than the original model quality. Table 5.1 — The quality of the end-to-end model on simulated TUM dataset Body part set Rank-1 [%] Rank-5 [%] Basic full body 98,0 99,7 Pose-based, three parts 99,0 100,0 Pose-based, five parts 98,8 99,8 Original RGB pose-based model 99,8 100,0 88 Besides this, the influence of detection quality on the whole model was evaluated. The simplest moving object detector was used and its mistakes could worsen the final result, thus we have conducted the same experiments, but using more accurate bounding boxes, computed from initial RGB videos by background subtraction procedure. All the other steps of the algorithm remain the same. Due to the static background the result of the subtraction is the silhouette mask of moving person which can be easily bounded, so, such bounding boxes are supposed to be really close to the ground truth. Using these boxes the effectiveness of optical flow itself is evaluated without the influence of misdetections. The results of such “pure” evaluation of the basic model and the best pose-based one are presented in Table 5.2. Although the accuracy of both models increased a bit, the results remain Table 5.2 — The quality of the perfect detected model on simulated TUM dataset Body part set Rank-1 [%] Rank-5 [%] Basic full body 98,3 99,6 Pose-based, three parts 99,1 99,8 Original RGB pose-based model 99,8 100,0 quite close to the end-to-end model. It shows that probable mistakes of detector do not decrease the quality greatly. Additionally to traditional classification problem, to complete the investigation, the verification task was considered. For each pair of probe and gallery videos the classififcation problem is stated whether they belong to the same person or not. The Euclidean distance between the neural descriptors is taken as the metrics and the classifier was constructed according to all possible thresholds in order to get a ROC curve and calculate ROC AUC and EER (see Fig. 5.8 and Table 5.3). The curves difference is hardly noticeable, thus, an auxiliary figure in a large scale is added to show that the curve corresponding to the three-part-based DVS model is closer to the upper-left corner of the plot than all the others. The numerical results provided in Table 5.3 also show that event-based model surpasses the original RGB-based one in verification task. Although the difference is quite small both AUC and EER turn out to be higher in event-based model than in conventional one. 89 Figure 5.8 — The ROC curves corresponding to three considered models and the original RGB model: fullydepicted (left) and large-scale (right) to show the difference Table 5.3 — Results of verification on TUM-GAID dataset Body part set ROC AUC EER [%] Basic full body 0,9985 1,89 Pose-based, three parts 0,9994 1,10 Pose-based, five parts 0,9992 1,34 Original RGB pose-based model 0,9975 1,26 5.6 Conclusions The research described in this chapter has verified the recognizability of the event streams obtained by the dynamic vision sensors. Despite the very simple and rough way of data visualization, the event-based frames contain enough information for human recognition. Due to the fact that gait recognition algorithm developed in this work and particularly in chapter 3 considers the motion of the points but not the appearance or any colour characteristics, it can easily be transfered to event data containing the information about the dynamics in the scene. Besides this, while solving the gait recognition problem, object detection and pose estimation problems were considered. The visualization method selected in this study makes the event flow visually similar to grayscale images semantically close to silhouette masks which makes the application of conventional methods natural. 90 The results of conducted experiments show that being applied to the videos containing single moving person modern computer vision methods achieve very high quality close to colored-video-based models and even superior to them in verification task. Hence, it turns out that the event-based data is not only presumably similar to the conventional one, but this closeness is confirmed experimentally. Such results encourage to apply conventional methods to the DVS data and continue developing them. 91 Conclusion In this thesis, the novel gait recognition approach is presented. Due to variability of influential factors and absense of one general database covering all possible conditions, the Author has selected some of the factors to overcome and proposed a method stable to their changes. The main contributions of this thesis are as follows. 1. Side-view gait recognition method which analyses the points translations between consecutive video frames is proposed and implemented. The method shows state-of-the-art quality on the side-view gait recognition benchmark. 2. The first multi-view gait recognition method based on the consideration of movements of the points in different areas of human body is proposed and implemented. The state-of-the-art recognition accuracy is achieved for certain viewing angles and the best at the investigation time approaches are outperformed in verification mode. 3. The influence of the point movements in different body parts on the recognition quality is revealed. 4. The original research of the gait recognition algorithm transfer between different data collections is made. 5. Two original approaches for view resistance improvement are proposed and implemented. Both methods increase the cross-view recognition quality and complement each other being applied simultaneously. The model obtained using these approaches outperforms the state-of-the-art methods on multi­ view gait recogntion benchmark. 6. The first method for human recognition by motion in the event-based data from dynamic vision sensor is proposed and implemented. The quality close to conventional video recognition is obtained. The further developement of the research is possible in the following areas: 1. Developement and implementation of view estimation methods by gait videos; 2. Integration of view information into a recognition model for identification quality improvement and increasing the view resistance; 92 3. Developement and implementation of multi-view RGB and event-based gait video synthesis by applying the motion capture methods and generation of 3D data to enlarge the existing training and testing datasets. 4. Synthetic data application for multi-view recognition quality increase in different types of data (RGB, event streams). 93 References 1. Maslow, A. Motivation and Personality / A. Maslow. –– Oxford, 1954. –– 411 p. 2. Cutting, J. E. Recognizing friends by their walk: Gait perception without fa­ miliarity cues / J. E. Cutting, L. T. Kozlowski // Bulletin of the Psychonomic Society. –– 1977. –– Vol. 9, no. 5. –– P. 353––356. 3. Murray, M. P. GAIT AS A TOTAL PATTERN OF MOVEMENT / M. P. Murray // American Journal of Physical Medicine & Rehabilitation. –– 1967. –– Vol. 46. 4. Lichtsteiner, P. A 128× 128 120 dB 15 µs Latency Asynchronous Temporal Contrast Vision Sensor / P. Lichtsteiner, C. Posch, T. Delbruck // IEEE Journal of Solid-State Circuits. –– 2008. –– Vol. 43, no. 2. –– P. 566––576. 5. 4.1 A 640×480 dynamic vision sensor with a 9µm pixel and 300Meps ad­ dress-event representation / B. Son [et al.] // 2017 IEEE International Solid-State Circuits Conference (ISSCC). –– 02/2017. –– P. 66––67. 6. A 240 × 180 130 dB 3 µs Latency Global Shutter Spatiotemporal Vision Sensor / C. Brandli [et al.] // IEEE Journal of Solid-State Circuits. –– 2014. –– Oct. –– Vol. 49, no. 10. –– P. 2333––2341. 7. The TUM Gait from Audio, Image and Depth (GAID) database: Multimodal recognition of subjects and traits. / M. Hofmann [et al.] // J. of Visual Com. and Image Repres. –– 2014. –– Vol. 25(1). –– P. 195––206. 8. The OU-ISIR Gait Database Comprising the Large Population Dataset and Performance Evaluation of Gait Recognition / H. Iwama [et al.] // IEEE Trans. on Information Forensics and Security. –– 2012. –– Oct. –– Vol. 7, Issue 5. –– P. 1511––1521. 9. Yu, S. A Framework for Evaluating the Effect of View Angle, Clothing and Carrying Condition on Gait Recognition. / S. Yu, D. Tan, T. Tan // Proc. of the 18’th ICPR. Vol. 4. –– 2006. –– P. 441––444. 94 10. Hofmann, M. Gait Recognition in the Presence of Occlusion: A New Dataset and Baseline Algorithms / M. Hofmann, S. Sural, G. Rigoll // 19th Inter­ national Conferences on Computer Graphics, Visualization and Computer Vision (WSCG). –– Plzen, Czech Republic, 01/2011. 11. Efficient Night Gait Recognition Based on Template Matching / Daoliang Tan [et al.] // 18th International Conference on Pattern Recognition (ICPR’06). Vol. 3. –– 08/2006. –– P. 1000––1003. 12. Gross, R. The CMU Motion of Body (MoBo) Database : tech. rep. / R. Gross, J. Shi ; Carnegie Mellon University. –– Pittsburgh, PA, 06/2001. 13. Gait Recognition under Speed Transition / A. Mansur [et al.] //. –– 06/2014. 14. The OU-ISIR Gait Database Comprising the Treadmill Dataset / Y. Makihara [et al.] // IPSJ Trans. on Computer Vision and Applications. –– 2012. –– Apr. –– Vol. 4. –– P. 53––62. 15. The OU-ISIR Gait Database Comprising the Large Population Dataset with Age and Performance Evaluation of Age Estimation / C. Xu [et al.] // IPSJ Trans. on Computer Vision and Applications. –– 2017. –– Vol. 9, no. 24. –– P. 1––14. 16. Lee, L. Gait Analysis for Recognition and Classification / L. Lee, W. E. L. Grimson // Proc IEEE Int Conf Face Gesture Recognit. –– 06/2002. –– P. 148––155. 17. The HumanID gait challenge problem: data sets, performance, and analysis / S. Sarkar [et al.] // IEEE Transactions on Pattern Analysis and Machine Intelligence. –– 2005. –– Feb. –– Vol. 27, no. 2. –– P. 162––177. 18. Shutler, J. On a Large Sequence-Based Human Gait Database / J. Shut­ ler, M. Grant, J. Carter // Applications and Science in Soft Computing. –– 2004. –– Jan. 19. Silhouette Analysis-Based Gait Recognition for Human Identification / L. Wang [et al.] // Pattern Analysis and Machine Intelligence, IEEE Trans­ actions on. –– 2004. –– Jan. –– Vol. 25. –– P. 1505––1518. 20. The OU-ISIR Large Population Gait Database with real-life carried object and its performance evaluation / M. Z. Uddin [et al.] // IPSJ Transactions on Computer Vision and Applications. –– 2018. –– May. –– Vol. 10, no. 1. –– P. 5. 95 21. Nixon, M. S. Human identification based on gait / M. S. Nixon, T. Tan, R. Chellappa //. Vol. 4. –– Springer Science & Business Media, 2010. 22. The effect of time on gait recognition performance / D. Matovski [et al.] // International Conference on Pattern Recognition (ICPR). Vol. 7. –– 04/2012. 23. The AVA Multi-View Dataset for Gait Recognition / D. López-Fernández [et al.] // Activity Monitoring by Multiple Distributed Sensing. –– Springer International Publishing, 2014. –– P. 26––39. –– (Lecture Notes in Computer Science). 24. Multi-view large population gait dataset and its performance evaluation for cross-view gait recognition / N. Takemura [et al.] // IPSJ Trans. on Computer Vision and Applications. –– 2018. –– Vol. 10, no. 4. –– P. 1––14. 25. Han, J. Individual Recognition Using Gait Energy Image / J. Han, B. Bhanu // IEEE Trans. Pattern Anal. Mach. Intell. –– Washington, DC, USA, 2006. –– Vol. 28, no. 2. –– P. 316––322. 26. Bashir, K. Gait recognition using gait entropy image / K. Bashir, T. Xiang, G. S // Proceedings of 3rd international conference on crime detection and prevention. –– 2009. –– P. 1––6. 27. Chen, J. Average Gait Differential Image Based Human Recognition. J. Chen, J. Liu // Scientific World Journal. –– 2014. / 28. Chrono-Gait Image: A Novel Temporal Template for Gait Recognition / C. Wang [et al.] // Computer Vision – ECCV 2010. –– Berlin, Heidelberg : Springer Berlin Heidelberg, 2010. –– P. 257––270. 29. Lam, T. H. Gait flow image: A silhouette-based gait representation for human identification / T. H. Lam, K. Cheung, J. N. Liu // Pattern Recognition. –– 2011. –– Vol. 44, no. 4. –– P. 973––987. 30. Effective Part-Based Gait Identification using Frequency-Domain Gait En­ tropy Features / M. Rokanujjaman [et al.] // Multimedia Tools and Applications. Vol. 74. –– 11/2013. 31. Gait Recognition Using a View Transformation Model in the Frequency Do­ main / Y. Makihara [et al.] // Computer Vision – ECCV 2006 / ed. by A. Leonardis, H. Bischof, A. Pinz. –– Berlin, Heidelberg : Springer Berlin Heidelberg, 2006. –– P. 151––163. 96 32. Dalal, N. Histograms of Oriented Gradients for Human Detection / N. Dalal, B. Triggs // Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 1 - Vol­ ume 01. –– Washington, DC, USA : IEEE Computer Society, 2005. –– P. 886––893. –– (CVPR ’05). 33. Multiple HOG templates for gait recognition. / Y. Liu [et al.] // Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012). –– 2012. –– P. 2930––2933. 34. Learning Realistic Human Actions from Movies / I. Laptev [et al.] // CVPR 2008 - IEEE Conference on Computer Vision & Pattern Recognition. –– Anchorage, United States : IEEE Computer Society, 06/2008. –– P. 1––8. 35. Yang, Y. Gait Recognition Using Flow Histogram Energy Image / Y. Yang, D. Tu, G. Li // 2014 22nd International Conference on Pattern Recogni­ tion. –– 2014. –– P. 444––449. 36. Cross-view gait recognition using joint Bayesian / C. Li [et al.] // Proc. SPIE 10420, Ninth International Conference on Digital Image Processing (ICDIP 2017). –– 2017. 37. Complete canonical correlation analysis with application to multi-view gait recognition / X. Xing [et al.] // Pattern Recognition. –– 2016. –– Vol. 50. –– P. 107––117. 38. Belhumeur, P. N. Eigenfaces vs. Fisherfaces: Recognition Using Class Spe­ cific Linear Projection / P. N. Belhumeur, J. P. Hespanha, D. J. Kriegman // IEEE Trans. Pattern Anal. Mach. Intell. –– 1997. –– July. –– Vol. 19, no. 7. –– P. 711––720. 39. Bashir, K. Gait recognition without subject cooperation / K. Bashir, T. Xi­ ang, S. Gong // Pattern Recognition Letters. –– 2010. –– Vol. 31. –– P. 2052––2060. 40. Cross-view gait recognition using view-dependent discriminative analysis / A. Mansur [et al.] // IEEE International Joint Conference on Biometrics. –– 2014. –– P. 1––8. 41. Joint Intensity and Spatial Metric Learning for Robust Gait Recognition / Y. Makihara [et al.] // 2017 IEEE Conference on Computer Vision and Pat­ tern Recognition (CVPR). –– 07/2017. –– P. 6786––6796. 97 42. View-Invariant Discriminative Projection for Multi-View Gait-Based Human Identification / M. Hu [et al.] // Information Forensics and Security, IEEE Transactions on. –– 2013. –– Dec. –– Vol. 8. –– P. 2034––2045. 43. Muramatsu, D. View Transformation Model Incorporating Quality Measures for Cross-View Gait Recognition / D. Muramatsu, Y. Makihara, Y. Yagi // IEEE transactions on cybernetics. –– 2015. –– July. –– Vol. 46. 44. Muramatsu, D. Cross-view gait recognition by fusion of multiple transforma­ tion consistency measures / D. Muramatsu, Y. Makihara, Y. Yagi // IET Biometrics. –– 2015. –– June. –– Vol. 4. 45. Arseev, S. Human Recognition by Appearance and Gait / S. Arseev, A. Konushin // Programming and Computer Software. –– 2018. –– P. 258––265. 46. Yoo, J.-H. Automated Markerless Analysis of Human Gait Motion for Recog­ nition and Classification / J.-H. Yoo // Etri Journal. –– 2011. –– Apr. –– Vol. 33. –– P. 259––266. 47. Whytock, T. Dynamic Distance-Based Shape Features for Gait Recognition / T. Whytock, A. Belyaev, N. M. Robertson // Journal of Mathematical Imag­ ing and Vision. –– 2014. –– Nov. –– Vol. 50, no. 3. –– P. 314––326. 48. Fusion of spatial-temporal and kinematic features for gait recognition with deterministic learning / M. Deng [et al.] // Pattern Recognition. –– 2017. –– Vol. 67. –– P. 186––200. 49. Castro, F. Pyramidal Fisher Motion for multiview gait recognition. / F. Cas­ tro, M. Marín-Jiménez, R. Medina-Carnicer // 22nd International Conference on Pattern Recognition. –– 2014. –– P. 1692––1697. 50. Simonyan, K. Two-stream convolutional networks for action recognition in videos. / K. Simonyan, A. Zisserman // NIPS. –– 2014. –– P. 568––576. 51. Beyond short snippets: Deep networks for video classification. [et al.] // CVPR. –– 2015. / J. Ng 52. Feichtenhofer, C. Convolutional two-stream network fusion for video action recognition. / C. Feichtenhofer, A. Pinz, A. Zisserman // CVPR. –– 2016. 98 53. Varol, G. Long-term Temporal Convolutions for Action Recognition / G. Varol, I. Laptev, C. Schmid // IEEE Transactions on Pattern Analysis and Machine Intelligence. –– 2017. 54. Wang, L. Action recognition with trajectory-pooled deep-convolutional de­ scriptors. / L. Wang, Y. Qiao, X. Tang // CVPR. –– 2015. –– P. 4305––4314. 55. DeepGait: A Learning Deep Convolutional Representation for Gait Recog­ nition / X. Zhang [et al.] // Biometric Recognition. –– Cham : Springer International Publishing, 2017. –– P. 447––456. 56. DeepGait: A Learning Deep Convolutional Representation for View-Invariant Gait Recognition Using Joint Bayesian / C. Li [et al.] // Applied Sciences. –– 2017. –– Feb. –– Vol. 7. –– P. 15. 57. VGR-Net: A View Invariant Gait Recognition Network / D. Thapar [et al.] // IEEE 4th International Conference on Identity, Security, and Behavior Anal­ ysis (ISBA). –– 10/2017. 58. Wolf, T. Multi-view gait recognition using 3D convolutional neural networks / T. Wolf, M. Babaee, G. Rigoll // 2016 IEEE International Conference on Image Processing (ICIP). –– 09/2016. –– P. 4165––4169. 59. GEINet: View-invariant gait recognition using a convolutional neural net­ work / K. Shiraga [et al.] // 2016 International Conference on Biometrics (ICB). –– 2016. –– P. 1––8. 60. A Comprehensive Study on Cross-View Gait Based Human Identification with Deep CNNs / Z. Wu [et al.] // IEEE Transactions on Pattern Analysis and Machine Intelligence. –– 2016. –– Mar. –– Vol. 39. 61. On Input/Output Architectures for Convolutional Neural Network-Based Cross-View Gait Recognition / N. Takemura [et al.] // IEEE Transactions on Circuits and Systems for Video Technology. –– 2019. –– Sept. –– Vol. 29, no. 9. –– P. 2708––2719. 62. Siamese neural network based gait recognition for human identification / C. Zhang [et al.] // IEEE International Conference on Acoustics, Speech and Signal Processing. –– 2016. –– P. 2832––2836. 63. GaitSet: Regarding Gait as a Set for Cross-View Gait Recognition / H. Chao [et al.] // AAAI. –– 2019. 99 64. Invariant feature extraction for gait recognition using only one uniform model / S. Yu [et al.] // Neurocomputing. –– 2017. –– Vol. 239. –– P. 81––93. 65. GaitGAN: Invariant Gait Feature Extraction Using Generative Adversarial Networks / S. Yu [et al.] // 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). –– 2017. –– P. 532––539. 66. Learning view invariant gait features with Two-Stream GAN / Y. Wang [et al.] // Neurocomputing. –– 2019. –– Vol. 339. –– P. 245––254. 67. Hochreiter, S. Long Short-Term Memory / S. Hochreiter, J. Schmidhuber // Neural Comput. –– 1997. –– Nov. –– Vol. 9, no. 8. –– P. 1735––1780. 68. Gait Identification by Joint Spatial-Temporal Feature / S. Tong [et al.] // Biometric Recognition. –– 2017. –– P. 457––465. 69. Feng, Y. Learning effective gait features using LSTM / Y. Feng, Y. Li, J. Luo // International Conference on Pattern Recognition. –– 2016. –– P. 325––330. 70. Memory-based Gait Recognition / D. Liu [et al.] // Proceedings of the British Machine Vision Conference (BMVC). –– BMVA Press, 09/2016. –– P. 82.1––82.12. 71. Pose-Based Temporal-Spatial Network (PTSN) for Gait Recognition with Car­ rying and Clothing Variations / R. Liao [et al.] // Biometric Recognition. –– Cham : Springer International Publishing, 2017. –– P. 474––483. 72. Battistone, F. TGLSTM: A time based graph deep learning approach to gait recognition / F. Battistone, A. Petrosino // Pattern Recognition Letters. –– 2019. –– Vol. 126. –– P. 132––138. –– Robustness, Security and Regulation Aspects in Current Biometric Systems. 73. Event-Based Moving Object Detection and Tracking / A. Mitrokhin [et al.] // 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). –– 10/2018. –– P. 1––9. 74. Spatiotemporal multiple persons tracking using Dynamic Vision Sensor / E. Piątkowska [et al.] // 2012 IEEE Computer Society Conference on Com­ puter Vision and Pattern Recognition Workshops. –– 06/2012. –– P. 35––40. 100 75. Yuan, W. Fast localization and tracking using event sensors / W. Yuan, S. Ra­ malingam // IEEE International Conference on Robotics and Automation (ICRA). –– 2016. –– P. 4564––4571. 76. Neuromorphic Vision Sensing for CNN-based Action Recognition / A. Chadha [et al.] // ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). –– 05/2019. –– P. 7968––7972. 77. Dynamic Vision Sensors for Human Activity Recognition / S. A. Baby [et al.] // 2017 4th IAPR Asian Conference on Pattern Recognition (ACPR). –– 11/2017. –– P. 316––321. 78. Bay, H. SURF: Speeded Up Robust Features / H. Bay, T. Tuytelaars, L. Van Gool // Computer Vision – ECCV 2006. –– 2006. –– P. 404––417. 79. Event-driven body motion analysis for real-time gesture recognition / B. Kohn [et al.] // 2012 IEEE International Symposium on Circuits and Systems (IS­ CAS). –– 05/2012. –– P. 703––706. 80. A Low Power, Fully Event-Based Gesture Recognition System / A. Amir [et al.] // 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). –– 07/2017. –– P. 7388––7397. 81. Asynchronous frameless event-based optical flow / R. Benosman [et al.] // Neural Networks. –– 2012. –– Vol. 27. –– P. 32––37. 82. Zhu, A. EV-FlowNet: Self-Supervised Optical Flow Estimation for Even­ t-based Cameras / A. Zhu [et al.] // Proceedings of Robotics: Science and Systems. –– Pittsburgh, Pennsylvania, 06/2018. 83. Wang, X. Intelligent multi-camera video surveillance: A review / X. Wang // Pattern Recognition Letters. –– 2013. –– Vol. 34, no. 1. –– P. 3––19. –– Extracting Semantics from Multi-Spectrum Video. 84. Large-scale video classification with convolutional neural networks. A. Karpathy [et al.] // CVPR. –– 2014. / 85. Burton, A. Thinking in Perspective: Critical Essays in the Study of Thought Processes / A. Burton, J. Radford. –– Methuen, 1978. –– 232 p. 86. Warren, D. Electronic Spatial Sensing for the Blind: Contributions from Per­ ception, Rehabilitation, and Computer Vision / D. Warren, E. R. Strelow. –– Netherlands, 1985. –– 521 p. 101 87. Farneback, G. Two-frame motion estimation based on polynomial expan­ sion. / G. Farneback // Proc. of Scandinavian Conf. on Image Analysis. Vol. 2749. –– 2003. –– P. 363––370. 88. OpenCV (Open Source Computer Vision Library). –– URL: https://opencv. org. 89. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields / Z. Cao [et al.] // CVPR. –– 2017. 90. Automatic Learning of Gait Signatures for People Identification / F. M. Cas­ tro [et al.] // Advances in Computational Intelligence. –– Cham : Springer International Publishing, 2017. –– P. 257––270. 91. Return of the devil in the details: Delving deep into convolutional nets. / K. Chatfield [et al.] // Proc. BMVC. –– 2014. 92. Ioffe, S. Batch normalization: Accelerating deep network training by reducing internal covariate shift. / S. Ioffe, C. Szegedy // ICML. –– 2015. 93. Krizhevsky, A. ImageNet Classification with Deep Convolutional Neural Net­ works / A. Krizhevsky, I. Sutskever, G. Hinton // Neural Information Processing Systems. –– 2012. –– Jan. –– Vol. 25. 94. Dropout: A Simple Way to Prevent Neural Networks from Overfitting / N. Srivastava [et al.] // Journal of Machine Learning Research. –– 2014. –– Vol. 15. –– P. 1929––1958. 95. Chen, T. Net2net: Accelerating learning via knowledge transfer. In Inter­ national Conference on Learning Representation. / T. Chen, I. Goodfellow, J. Shlens // ICLR. –– 2016. 96. Glorot, X. Understanding the difficulty of training deep feedforward neural networks / X. Glorot, Y. Bengio // Proceedings of the Thirteenth Interna­ tional Conference on Artificial Intelligence and Statistics. Vol. 9 / ed. by Y. W. Teh, M. Titterington. –– Chia Laguna Resort, Sardinia, Italy : PMLR, 05/2010. –– P. 249––256. –– (Proceedings of Machine Learning Research). 97. Simonyan, K. Very deep convolutional networks for large-scale image recog­ nition. / K. Simonyan, A. Zisserman // ICLR. –– 2015. 102 98. Triplet probabilistic embedding for face verification and clustering / S. Sankaranarayanan [et al.] // 2016 IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS). –– 2016. –– P. 1––8. 99. Small medical encyclopedia //. Vol. 4 / ed. by V. Pokrovskiy. –– 1996. –– P. 570. 100. Convolutional pose machines / S.-E. Wei [et al.] // CVPR. –– 2016. 101. Deep Residual Learning for Image Recognition / K. He [et al.] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). –– 2015. –– P. 770––778. 102. Zagoruyko, S. Wide Residual Networks / S. Zagoruyko, N. Komodakis // Proceedings of the British Machine Vision Conference (BMVC) / ed. by E. R. H. Richard C. Wilson, W. A. P. Smith. –– BMVA Press, 09/2016. 103. Lasagne: First release. / S. Dieleman [et al.]. –– 08/2015. –– URL: http: //dx.doi.org/10.5281/zenodo.27878. 104. Castro, F. M. Multimodal Features Fusion for Gait, Gender and Shoes Recog­ nition / F. M. Castro, M. J. Marín-Jiménez, N. Guil // Mach. Vision Appl. –– 2016. –– Nov. –– Vol. 27, no. 8. –– P. 1213––1228. 105. Deep multi-task learning for gait-based biometrics / M. Marín-Jiménez [et al.] // IEEE International Conference on Image Processing (ICIP). –– 09/2017. 106. Guan, Y. A robust speed-invariant gait recognition system for walker and runner identification / Y. Guan, C.-T. Li. –– 2013. –– Jan. 107. Cross-view gait recognition using view-dependent discriminative analysis / A. Mansur [et al.] // IJCB 2014 - 2014 IEEE/IAPR International Joint Con­ ference on Biometrics. –– 12/2014. 108. PyTorch: An Imperative Style, High-Performance Deep Learning Library / A. Paszke [et al.] // Advances in Neural Information Processing Systems 32. –– Curran Associates, Inc., 2019. –– P. 8024––8035. 109. Lichtsteiner, P. A 128×128 120 dB 15µs Latency Asynchronous Temporal Contrast Vision Sensor / P. Lichtsteiner, C. Posch, T. Delbruck // IEEE Journal of Solid-State Circuits. –– 2008. –– Vol. 43, no. 2. –– P. 566––576. 103 110. Newell, A. Stacked Hourglass Networks for Human Pose Estimation / A. Newell, K. Yang, J. Deng // ECCV. –– 2016. 111. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis / M. Andriluka [et al.] // CVPR. –– 2014. –– P. 3686––3693. 112. The event-camera dataset and simulator: Event-based data for pose estima­ tion, visual odometry, and SLAM / E. Mueggler [et al.] // The International Journal of Robotics Research. –– 2017. –– Vol. 36, no. 2. –– P. 142––149. 104 List of Figures 0.1 1.1 1.2 1.3 1.4 1.5 1.6 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 3.1 3.2 3.3 3.4 3.5 Informal problem statement: having a video with a walking person one needs to determine this person’s identity from the database . . . . . . . Examples of factors affecting the gait itself: using a supporting cane (a), wearing high heels (b) and carrying conditions: bag (c), backpack (d) or other loads (e) . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples of factors affecting the gait representation: wearing an overall (a), long dress (b) or coat (c), occlusions (d), and bad lightening and low quality of the frame (e) . . . . . . . . . . . . . . . . . . . . . . . . . Examples of frames from TUM-GAID database . . . . . . . . . . . . . . Examples of frames from CASIA Gait Dataset B . . . . . . . . . . . . . Examples of frames from OU-ISIR Large Population Dataset . . . . . . Examples of basic gait descriptors . . . . . . . . . . . . . . . . . . . . . The same frame with different brightness, contrast and saturation settings Optical flow as a vector field of motion between consecutive frames . . . General algorithm pipeline . . . . . . . . . . . . . . . . . . . . . . . . . The visualizations of OF components and the constructed image . . . . Input block. The square box cropped from 10 consecutive frames of horizontal (the first and the second rows) and vertical (the third and the fourth rows) components of normalized OF maps . . . . . . . . . . . The scheme of CNN-M fusion architecture . . . . . . . . . . . . . . . . The pipeline of the algorithm based on blocks of optical flow maps . . . Side-view frames from TUM and CASIA gait databases . . . . . . . . . Gait independent motions of hands and head that are not informative . The visualization of optical flow between two silhouette masks (two channels are the components of OF and the third one is set to zero) . . The pipeline of the pose-based gait recognition algorithm . . . . . . . . The bounding boxes for human body parts: right and left feet (small blue boxes at the bottom), upper and lower body (green boxes in the middle) and full body (big red box) . . . . . . . . . . . . . . . . . . . . The visualizations of optical flow maps in five areas of the human figure 6 13 13 14 15 15 20 27 27 28 29 31 36 38 43 46 47 48 49 50 105 3.6 3.7 3.8 4.1 4.2 4.3 4.4 4.5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 The frame with human figure bounding box (red) and the border of probable crops (green) depicted and the examples of patches that can be cropped from this one frame . . . . . . . . . . . . . . . . . . . . . . Consturction of residual blocks in Wide ResNet architecture . . . . . . . ROC (the first row) and CMC (the second row) curves under 85° gallery view and 55°, 65°, and 75° probe view, respectively . . . . . . . . The example of the same walk captured under different viewing angles . The desired cluster structure of the data . . . . . . . . . . . . . . . . . The pipeline of gait recognition algorithm with two modifications (view loss and CV TPE) shown in green boxes . . . . . . . . . . . . . . . . . The curve of the training loss during three steps of training process . . . The desired mutual arrangement of the elements in a triplet . . . . . . . The transformation of the event stream into a sequence of frames . . . . The examples of event-based image(left) and blob detection result (right) The pipeline of the event-based gait recognition algorithm . . . . . . . . Examples of the skeletons estimated by fine-tuned Stacked Hourglass model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Human figure bounding boxes after blob detection (left) and after refinement by pose estimation (right) . . . . . . . . . . . . . . . . . . . Two components of event-based optical flow maps . . . . . . . . . . . . The visualizations of the real and simulated event streams . . . . . . . . The ROC curves corresponding to three considered models and the original RGB model: fullydepicted (left) and large-scale (right) to show the difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 53 61 66 67 68 70 71 78 79 79 82 83 84 86 89 106 List of Tables 1.1 Gait recognition datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1 2.2 2.3 2.4 2.5 2.6 The baseline architecture . . . . . . . . . . . . . . . . . . . . . . . . . . The VGG-like architecture . . . . . . . . . . . . . . . . . . . . . . . . . Results on TUM-GAID dataset . . . . . . . . . . . . . . . . . . . . . . Quality of recognition on TUM-GAID dataset under different conditions The quality of transfer learning . . . . . . . . . . . . . . . . . . . . . . The quality of the jointly learned model . . . . . . . . . . . . . . . . . . 34 36 41 42 43 44 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 . . . . . . . . 52 56 57 57 58 59 60 61 . . . . 62 63 64 64 Comparison of average cross-view recognition accuracy of proposed approaches and state-of-the-art models . . . . . . . . . . . . . . . . . . 75 The quality of the end-to-end model on simulated TUM dataset . . . . . The quality of the perfect detected model on simulated TUM dataset . . Results of verification on TUM-GAID dataset . . . . . . . . . . . . . . . 87 88 89 Wide Residual Network architecture . . . . . . . . . . . . . . . . . . . Recognition quality on TUM-GAID dataset . . . . . . . . . . . . . . . Recognition quality on TUM-GAID dataset under different conditions Recognition accuracy on CASIA lateral part . . . . . . . . . . . . . . Recognition quality on CASIA lateral part under different conditions . Average recognition rates for three angles on CASIA database . . . . . Comparison of Rank-1 and EER on OULP dataset . . . . . . . . . . . Comparison of Rank-1 on OULP dataset following five-fold cv protocol Recognition quality of models based on different sets of body parts on TUM-GAID dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Recognition quality for different video lengths on TUM-GAID dataset 3.11 Recognition quality for different video lengths on CASIA-B dataset . . 3.12 The quality of transfer learning . . . . . . . . . . . . . . . . . . . . . 4.1 5.1 5.2 5.3