Audio-Visual localization of Humans for Robotic Patrolling of Indoor Environments

Audio-Visual localization of Humans for Robotic Patrolling of Indoor Environments - Aalborg University - Project Report ROB3_gr01 Aalborg University Electronics and IT Copyright © Aalborg University 2015 Electronics and IT Aalborg University http://www.aau.dk Title: Audio Visual Localisation Theme: Automatic Sensing of the Environment Project Period: Fall Semester 2022 Project Group: ROB3_gr01 Participant(s): Jonathan Rod Skarregaard Silas Porsgaard Steensgaard Christoffer Thomas Ulf Koch Andersen Hans Henrik Dalgaard Peter Plass Jensen Supervisor(s): Jesper Rindom Jensen Copies: 1 Page Numbers: 80 Abstract: According to Danmarks Statistik, there have been committed 3849 burglaries at company- and business properties in the first six months of 2022. This report looks to explore the possibility of developing a mobile robot, equipped with video and audio sensors, to patrol large business properties in order to decrease the workload of security forces while maintaining a high degree of safety. The robot uses audio source localization to detect sound anomalies while patrolling and motion detection algorithms on the camera feed to serve as an early warning system for possible intrusions. The audio source localization uses a three-microphone array and cross-correlation to determine interaural time differences which allow for the estimation of sound source direction. The motion detection algorithm uses a live video feed and performs background subtraction on the images to detect and draw bounding boxes around objects in motion. Date of Completion: April 30, 2023 The content of this report is freely available, but publication (with reference) may only be pursued due to agreement with the author. Contents Preface vi 1 Introduction 2 Problem Analysis 2.1 Target Demographic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Break-ins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Environmental Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Lighting Conditions . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Sound Conditions . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 On Sound Diffraction . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Using Multiple Sensors . . . . . . . . . . . . . . . . . . . . . . 2.4 Alternative Solutions to assist Security Guards . . . . . . . . . . . . . 2.4.1 A Camera-microphone solution . . . . . . . . . . . . . . . . . 2.4.2 A Mobile Robot Solution . . . . . . . . . . . . . . . . . . . . . 2.5 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Commercial products . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Path Planning using Reinforcement Learning and Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4 State of the Art Conclusion . . . . . . . . . . . . . . . . . . . . 2.6 Sensor Possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Audio Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Visual Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Subconclusion & Problem Formulation . . . . . . . . . . . . . . . . . 13 13 14 14 14 17 Requirements 19 3 1 iii 2 2 3 5 5 6 7 8 9 9 9 10 10 11 Contents iv 4 . . . . . . 21 21 23 27 27 27 28 . . . . . . . . . . . . 30 30 30 32 35 35 36 37 37 37 40 41 42 5 6 7 Design Concept 4.1 Camera Setups . . . . . . . . . . 4.2 Sound localization . . . . . . . . 4.3 On the use of multiple sensors . 4.4 Concept Selection . . . . . . . . . 4.4.1 Microphone Selection . . 4.4.2 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methodology 5.1 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Audio Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Fourier Transform in Audio Analysis . . . . . . . . . . . . 5.2.2 Using The Fourier Transform To Detect Sound Anomalies 5.2.3 Nyquist-Shannon Sampling Theorem . . . . . . . . . . . . 5.2.4 Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Spatial Aliasing . . . . . . . . . . . . . . . . . . . . . . . . 5.2.6 Sampling Rate . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Motion Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Auditory localization . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Determining TDOA . . . . . . . . . . . . . . . . . . . . . . 5.5 Farfield DOA proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implementation 6.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Onboard Hardware . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Additional Hardware . . . . . . . . . . . . . . . . . . . . 6.2 Motion detection program . . . . . . . . . . . . . . . . . . . . . . 6.3 Anomaly Detection, TDOA and sound source locating program 6.4 ROS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Project ROS implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 48 48 49 52 54 54 55 56 56 Verification and Use Case Validation 7.1 Resource limitations . . . . . . . . . . . 7.2 Motion detection program Testing . . . 7.2.1 The test cases . . . . . . . . . . . 7.2.2 Results of motion detection tests 7.2.3 Visual Test Conclusion . . . . . . . . . . . . . . . . . . . . . 59 59 59 59 60 61 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Contents 7.3 8 9 Audio Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Audio Testing Results . . . . . . . . . . . . . . . . . . . . . . . Discussion 8.1 Sources of error . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Sound-Associated Errors . . . . . . . . . . . . 8.1.2 Light and Color Associated Errors . . . . . . . 8.2 Areas of Improvement . . . . . . . . . . . . . . . . . . 8.2.1 Areas of Improvement in the Sound Analysis 8.2.2 Areas of Improvement in Motion Detection . 8.2.3 Miscellaneous Areas of Improvement . . . . . 8.3 Requirement Fulfillment . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 63 63 67 67 67 68 68 68 69 69 70 72 Bibliography 74 A RQT_graph 79 Preface Aalborg University, April 30, 2023 This project has been written by gr_01 of the 3rd semester of the Robotics bachelor at Aalborg University. The project was written over a 4-month period, from the beginning of September until the end of December. The report discusses the use of audio-visual localization in security robotics for patrolling large business properties. We would like to extend our gratitude to our supervisor for the guidance he has given us. The source code associated with this project is publically available in the master branch of a Git repository found in the following link: https://github.com/hh4000/p3_project Hans Henrik Dalgaard Peter Plass Jensen <hdalga21@student.aau.dk> <ppje21@student.aau.dk> Silas Porsgaard Stensgaard Christoffer Thomas Ulf K. Andersen <spst21@student.aau.dk> <ctuk21@student.aau.dk> Jonathan Rod Skarregaard <jskarr21@student.aau.dk> vi Chapter 1 Introduction It is in the best interest of the majority of businesses and companies to keep their properties and assets safe from burglaries and vandalism. Because of this, many companies have taken measures to ensure this safety by installing security cameras, motion detectors, and alarms on their properties and in their buildings. These measures give law enforcement the ability to react swiftly to security breaches with accurate descriptions of the intruders. Unfortunately, law enforcement entities do not always have the necessary resources to respond to burglaries in time, and because of this, many large businesses have chosen to employ private security companies or guards to patrol and keep their businesses safe. Private security companies and guards are expensive and contribute to additional expenses for their customers because security guards often work high-wage night shifts and often require one or more security guards to patrol the premises of the businesses. According to Danmarks Statistik, there have been 3849 burglaries of companies and businesses in the first six months of 2022 alone, which corresponds to 21 burglaries per day [41]. These burglaries can pose significant financial losses on businesses and companies through loss of productivity and loss of assets. These losses may seem negligible to large companies, but it is still in the best interest of the company to prevent as many as possible. Because of this, this report looks to explore the possibility of using robotics to help relieve the workload of private security measures and expenditures of private companies while maintaining a high degree of security. 1 Chapter 2 Problem Analysis Property security is an ever-present problem for private homes, commercial enterprises, and industrial complexes. As technology has progressed, there have come many more ways of implementing security measures to prevent and detect intrusion. This includes but is not limited to, security cameras, guards, and motion sensors. Another way to increase security could be to implement a patrolling security robot capable of detecting intruders. This problem analysis aims to cover all relevant aspects of the above-mentioned patrolling security robot. Firstly, the target demographic is analyzed in Section 2.1. After this, the environmental challenges of the expected environment and potential ways to circumvent them are tackled in Section 2.3. This is followed by a market analysis exploring the state of the art, regarding security robotics, in Section 2.5. Lastly, a final problem formulation is formed in Section 2.7. 2.1 Target Demographic The required complexity of the robot means that the price of the robot is likely to make it a less attractive option for smaller companies, as these companies are less likely to have expensive inventory and equipment to make the robot financially viable. The use of the robot may also be viable for companies that may not have high-value physical objects, but instead have confidential information that would be detrimental in the hands of the wrong people, such as data centers. The robot in itself may not be enough to prevent theft, as the robot has no functions for apprehending or stopping intruders beyond alerting on-site security or police and mainly functions as deterrence as it only has functions for discovering the 2 2.2. Break-ins 3 intruders. This means that it functions optimally when implemented in combination with on-site security. Additionally, the use of security cameras in the surveillance area would need to be less attractive, whether due to the cost of implementing full coverage camera surveillance of the area or if security is only needed short term. If this is not the case, this solution would likely not be logical for the company, as even with an optimal implementation, 24/7 complete surveillance would be more effective. This leaves the following businesses as some of the potential consumers of the product: • Gated communities • University campuses • Critical infrastructure (such as electrical substations and water sources) • Apartment complexes and offices • Factories and warehouses There may be some companies that decide to implement the robot to supplement current security measures with a physical deterrent. This type of implementation has been attempted in places such as Liberty Village in Las Vegas, USA. While there is no definitive proof that this solution decreased crime rates in the apartment complex, the area was removed from the Las Vegas Metropolitan Police Department’s list of top 10 areas with most frequent 911 calls after the launch of a Knightscope patrol robot in the area. Therefore, the physical presence of a robot may be an effective deterrent [11]. 2.2 Break-ins Burglaries in Denmark have been on a decline for the past many years, but, as mentioned in chapter 1, there have been thousands of company burglaries in Denmark in the first 6 months of 2022 alone [42]. So, while burglaries are on a downward trend, companies still have an interest in keeping security high as the chance of a burglary is never zero. A graph of the burglaries in Denmark can be seen in Figure 2.1. Construction sites and construction workers’ vans often fall victim to thefts at night due to their valuable tools. When it happens, the workers are usually out of commission for a day or two while they report the incident to the police or are 2.2. Break-ins 4 Figure 2.1: A graph from Danmarks Statistik showing the number of burglaries in residential buildings in blue, businesses in orange, and uninhabited residential buildings in green[21]. transferred to a different project. Thefts at construction sites and vans amount, on average, to 2286 € not including wasted company time [20]. Most companies, especially larger companies, currently have security cameras and alarms to help prevent incidents, but these systems are stationary and can be tampered with, and implementing a full security force is expensive [14]. Additionally, some companies (e.g. factories) have vast properties that are nearly impossible to patrol sufficiently with security forces within a realistic budget. This makes a robotic patrol unit more attractive. Many large companies with high-value products or information do however still employ security firms and security guards for either increased security, deterrence, or faster reaction time to intrusions. Security guards have many responsibilities, including but not limited to patrolling, standing guard, apprehending intruders and perpetrators, etc. some of which can be done by an autonomous mobile robot. An average hourly wage of a security guard is around 22 €/h depending on whether you have experience or not. An extra 3.9 €/h is given to shifts extending into the hours between 17:00 and 06:00. If a company would need a security guard on their premises between the hours 22:00 and 06:00 every day, this would total 56 hours per week. Assuming that the security guard is working for an average of 25.9 € per hour, this would total approx. 75 2.3. Environmental Challenges 5 400 € annually (not including bonuses for working during weekends or holidays) [39]. The implementation of a robotic patrolling unit cannot fully replace an entire security team, but it can replace parts of the job such as patrolling and standing guard. Therefore the company can cover larger areas with only a few guards for the verification and apprehension of the intruder. 2.3 Environmental Challenges Since systems using audio or visual localization are entirely reliant on their sensor inputs, it is crucial to ensure appropriate environmental conditions for optimal operation. A perceptive system that is flooded with noise is essentially blind or at least impaired since all sensor inputs are rendered useless, due to the inputs not having any distinguishable information. This section will cover the consequences of poor conditions and some ways to circumvent these audio-visual challenges. 2.3.1 Lighting Conditions Systems that utilize computer vision, image processing, and the like are dependent on appropriate lighting conditions in order for the onboard cameras to capture useful images of the environment. This is especially relevant for systems that will operate indoors, at night, or both. If the lighting conditions are poor, the system could compensate by using pre-processing on the image to increase the brightness and/or contrast. However, this increases the visual noise, making it more difficult to gather correct information from the image, but still might be of value. As seen in Figure 2.2, the estimated brightness of different environments is listed. The brightness of the environment is measured in lux, also known as lumens per square meter. Some of the important values here are those of the "Hallway" and the "Office lighting", as these are the environments the robot will mostly be surveying. The hallway with an estimated brightness of 80 lux, will be our benchmark for a well-lit environment, all environments with a lux level of over 80 will be referred to as "well-lit". Environments with a lux level below 80 will be referred to as "poorly lit" from this point on. "Office lighting" and "Full daylight" will mostly apply when performing surveillance during the day, as most offices’ artificial lighting is limited throughout the nighttime. These values lie well within the requirement of a well-lit environment. It is important to note that "Office lighting" and "Hallway" are not the only light levels covered by the target demographic, but those are the most prevalent; other non-mentioned lighting conditions are "Overcast day", "Very dark 2.3. Environmental Challenges 6 overcast day", and "Minimal street lighting", among others. Figure 2.2: A table showing the estimated brightness of environments in lux[13]. The issue of capturing useful images can be tackled by either ensuring proper lighting conditions in the operational environment, or by equipping the system with an array of different cameras that are useful in different scenarios and under different lighting conditions. If a patrolling security robot is examined as an example, it is safe to assume that the robot will be looking for either intruders or signs of intrusion. This means that an infrared camera could be used to record heat signatures when lighting conditions do not allow for a clear identification of the intruder. Another method of assisting navigation and identifying intruders in poor lighting conditions is using depth sensors to detect movement. This sensor is also useful when characterizing the topography of the environment, as well as detecting obstacles. However, these additional sensors naturally drive up the total production cost of each unit. Some of these sensors will be further elaborated in chapter 2.6.2. 2.3.2 Sound Conditions Much like systems that utilize vision, systems that depend on audio input need appropriate conditions for optimal operation. An audio-perception system in a noisy facility would not work as optimally as in a quiet office building since the incoming audio anomalies are much easier to distinguish in a quieter environment. It is safe to assume that the system will be looking for anomalies such as windows 7 2.3. Environmental Challenges breaking, doors opening, and closing, as well as unexpected sounds of movement. If the environment is especially noisy, the sounds that the system would like to detect are easily drowned out by background noise. When using sound sensors, there are generally two ways to utilize them. One way is to use Sonar mapping, i.e. actively producing noise at a certain pitch and using return times to map out the environment. Another way is to analyze the sound from the environment, to calculate the direction of the sounds of interest. 2.3.3 On Sound Diffraction Before examining audio sensors, this section will give an overview of the properties of sound. This will give a broader understanding of why one might use sound sensors. It is a well-known fact that sound is capable of bending around corners. This is due to the wave property of diffraction, which occurs for all types of waves [18]. As an example, single-slit diffraction will be examined below. When speaking of single-slit diffraction, it is often light that is taken into consideration. In this case, the below equation [18](equation 2.1) is often used to determine the position of fringes. sin(θ ) = mλ , (m = ±1, ±2, ±3, ...) a (2.1) In equation 2.1, θ is the angle from the slit to the center of the m’th dark fringe on the screen, λ is the wavelength, and a is the slit width. In addition to this, the intensity at different angles can be calculated as seen below (equation 2.2) [18]. sin(π sin(θ ) a ) 2 λ (2.2) I = I0 π sin(θ ) λa In equation 2.2 I is the intensity at a given angle θ and I0 is the intensity at θ = 0. The application of equation 2.2 at different ratios of λa is explored in figure 2.3. In figure 2.3 it can be seen that all functions have a central peak around θ = 0. This peak will henceforth be referred to as the central intensity maximum. Additionally, it can be seen that, as λa increases, the central intensity maximum becomes wider. One thing that is apparent from equation 2.1, is that it is only usable when the ratio 8 2.3. Environmental Challenges λ a is less than or equal to 1. If λa > 1 there would be no solution to the equation. This does however open the question of what happens when the wavelength λ is larger than the slit width a. From equation 2.2 it can be seen, that the central intensity maximum would extend beyond 180◦ [18]. This is relevant since most sounds have a large wavelength. The sound waves of human speech generally have wavelengths of 1 m or greater[18]. Assuming that a doorway is 1 m or less wide, this would be a case of the wavelength being longer than the width of the slit (the width of the doorway). In this case, the sound could easily travel through the doorway and spread into whatever room the doorway connects to, even without taking sound reflection into account. 1 a=λ a = 3λ a = 10λ 0.8 I/I0 0.6 0.4 0.2 0 −40 −20 0 20 40 θ (degrees) Figure 2.3: Graph of intensity spread of three different 2.3.4 a λ ratios. Using Multiple Sensors One way to circumvent the issue of poor operational conditions can be to employ a variety of sensor types. In this way, when one sensor has poor conditions, another may have better conditions. This could, for instance, be in a dark quiet environment. Here, the visual sensors would be impaired, but the sound sensors would not. Additionally, under optimal conditions, the sensors could supplement each other, making the robot better at sensing in general. In the case of utilizing cameras and sound sensors, this could mean that the sound sensors could detect sound 2.4. Alternative Solutions to assist Security Guards 9 anomalies outside of the camera’s view, and the camera could detect anomalies not making audible noise. 2.4 Alternative Solutions to assist Security Guards This section will discuss solutions for security that can help assist security guards or limit the amount of work required on-site. 2.4.1 A Camera-microphone solution Using a set of cameras to monitor a perimeter is generally the most common method. These camera feeds are usually either actively monitored, or stored for reference, in the event of an intrusion. The cameras are occasionally equipped with microphones to provide additional insight when detecting intruders. However, a problem arises when the sound has to be reviewed; either a guard has to listen to sounds from many different cameras one at a time with a high risk of missing the anomalies or the sound has to be stored with the footage for later, making it, in most cases, useless for catching the intruder in the act. It is possible to create an algorithm that can search all the camera microphones for anomalies at once, but anything but a perfect algorithm will introduce false positives. However, the algorithm would allow the guard to only listen to the relevant camera microphones, but the cameras can’t investigate the sounds any further, as a robot solution could, and the guard will likely have to do that themselves. This will take the guards’ attention away from the cameras, which is time the intruders can use to get past the cameras. 2.4.2 A Mobile Robot Solution A mobile robot can be equipped with a camera and a microphone array giving it the same capabilities as a security camera with a microphone. The difference is the ability of the robot to move around in its environment. This allows the robot to investigate possible sound anomalies and even locate an intruder. The robot is still unable to intercept an intruder meaning there is always a need to have at least one security guard, but they will not have to waste their time investigating false anomalies. To accomplish these tasks the robot will need a navigation program to find its way around. One way to do this is to teach the path by manually moving the robot around the path, but there are many ways. It is important to limit the path of the robot so that it only moves where it needs to. The robot should also move 10 2.5. State of the art in unpredictable ways to confuse or surprise the intruder and hopefully deter any attempt to get past the security. Applying motion detection on the robot’s camera is also a possibility, even while it is moving, meaning you won’t lose any of the abilities of the static camera. 2.5 State of the art When developing new solutions it is wise to look at what the market has to offer in terms of existing products, to determine the state of the art. As such, this section will look at state-of-the-art commercial products, concepts, and research to explore the current market for existing patrolling robot solutions. Max Speed Dimensions Weight Usage Route Generation Navigation Sensors Argus S5 [23] 4-6 km/h 1750 x 780 x 1420 mm 185 kg Outdoor (nighttime) Pre-programmed Visual Thermal, Panorama Camera Nimbo [3] 16 km/h 660 mm x 580 mm 23 kg Indoor Pre-programmed Visual Camera Knightscope K5 [43] ca. 5 km/h 1587.5 x 850.9 x 914.4 mm ca. 180 kg Outdoor and Indoor Pre-programmed Visual LiDAR, Sonar, GPS, Wheel Table 2.1: Specifications table of state-of-the-art products 2.5.1 Commercial products The current market for commercial patrol robots is relatively small. The biggest firm on the market currently is SMP Robotics [15]. They offer a variety of patrolling robots, all with different functions and purposes. This section will compare some of the commercially available products on the market, to get a greater understanding of what the market offers, as well as get an idea of key features and performance metrics. As seen in Table 2.1 the current products on the market are quite similar in almost every category. However, the way they sense their surroundings is one of the major differences. The Argus S5 uses both thermal and panorama cameras to view its surroundings, which makes it more efficient in its outdoor dark environment than a regular camera would be. In contrast, the Nimbo, which is designed to patrol indoor environments, only uses regular cameras made to detect visible light, as the expected setting will likely be relatively illuminated. One of the notable differences is the high maximum speed of the Nimbo robot, which is a 3-4 fold. This is likely 2.5. State of the art 11 due to the decreased weight of the robot. Keeping the robot lightweight means that the motor has to be less powerful, making the robot cheaper. This decreased price would also make the robotic solution more attractive for smaller companies. • Knightscope The Knightscope is an outdoor patrolling robot, the robot is able to recharge itself autonomously. It has a max speed of 5 km/h and has 360° vision. The robot is usually implemented in outdoor environments such as parking lots, and malls. It is capable of detecting faces and alerts when detecting known criminals. Companies that use the Knightscope have reported lower crime rates in the areas where the robot is deployed. This is attributed to the physical presence of the robot as a deterrent. • Argus The Argus is a patrol robot that is capable of visually detecting intruders. Upon detection, the intruder is warned and the staff of the facility is informed of the intruder’s presence and location. The robot has facial detection, which allows it to differentiate a worker from an intruder. The robot can also operate under low light conditions, which means that the designated area can also be surveilled during the evening and night hours. The Argus functions optimally as part of a group of patrol robots, as this allows fewer blind spots. • Nimbo Nimbo is a multipurpose patrolling robot. In addition to the standard features that most patrolling robots have, the Nimbo is also capable of being used as a hoverboard. Nimbo is usually deployed in indoor environments such as warehouses, shopping centers, and educational facilities. It mostly acts as a moving camera, though it is capable of sounding an alarm when intruders are detected. 2.5.2 Research A lot of the newest research on patrolling robots is focused on patrolling logic, interaction with intruders, and randomization of the patrolling path. This section will take a look at some chosen research papers and give a short summary of them in order to gain insight into what the current problems are of patrolling robots, as well as to generate ideas for future solutions. • A Survey of Multi-robot Regular and Adversarial Patrolling In this paper, the researchers took the problem of navigating a dynamic and 12 2.5. State of the art uncertain environment and tried to implement an algorithm that could be used in real-time[17]. The algorithm used in the paper can be simplified into three steps 1. The smallest possible rectangle that covers the boundary of the room is set. 2. The minimum number of circles, with a circumference that encompasses the sensor area of the robot, that cover the rectangle is placed. 3. A patrolling path is searched along the boundary of the set in a spiral, an example of this can be seen in Figure 2.4 Figure 2.4: A figure of the coverage path[17] The benefit of using a spiral as seen in Figure 2.4, is that a patrol using a spiral shape minimizes the number of repeated circles patrolled. This solution could be upscaled to fit any room size and the radius of the sensors or functions of the robot. In pseudo-code, the algorithm could look as such 1. Set start point to current point 2. Set all other point A to unvisited loop 3. Find an unvisited neighboring point whose distance to the boundary is the smallest 4. If no neighbor point is found then mark it as visited and stop at the end 2.5. State of the art 13 5. Mark as visited and set Current point to Neighboring point 6. Loop End 2.5.3 Path Planning using Reinforcement Learning and Neural Networks [27] When using a robot to patrol a large area, it often has specific points that must be inspected and surveyed. The travel between these points is often through areas where surveillance is less necessary, meaning that these distances and travel times being minimized leads to a higher level of security. Finding the optimal route between these points is simply in nondeterministic polynomial time when calculating between a low number of points. Nondeterministic polynomial or NP refers to "A decision problem (a problem that has a yes/no answer) is said to be in NP if it is solvable in polynomial time by a non-deterministic Turing machine. Equivalently, and more intuitively, a decision problem is in NP if, if the answer is yes, a proof can be verified by a Turing machine in polynomial time." However, this task becomes increasingly difficult in larger areas that need patrolling and the number of points required, as the number of possible routes increases drastically. To find the optimal route for large facilities, companies have begun to use reinforcement learning and neural networks to plan this path. This is done by representing the distances and times between points as costs and treating arrival at the points as a reward. Using this method, the software iteratively finds a close to the optimal path by minimizing the distance traveled, and maximizing the time spent at the desired locations. This method has been shown to outperform other methods and can find a nearly optimal path with low computational expenses up to 100 different patrol points. 2.5.4 State of the Art Conclusion The research done on the existing commercial products has given an overview of what has already been produced. This gives an idea of how saturated the market is and also how well a similar product would do in terms of sales. This information can be used to avoid redundant solutions that would likely result in a financial loss. Furthermore, the specifications and functionalities of the existing solutions also give an understanding of what holes in the market may be present. The final takeaways from this section are related to the movement function of the robot. Although this aspect of the robot will not be the main focus of this project, the 2.6. Sensor Possibilities 14 information can be taken into consideration when designing the robot, and for a fully fledged patrol planning and possibly robot swarm system in the future. 2.6 Sensor Possibilities For the solution to be able to detect intruders and traverse its surroundings, it will need several sensors. These sensors can be divided into 2 groups: Visual and Audio. 2.6.1 Audio Sensors Dynamic Microphones Dynamic microphones use an induction coil in a magnetic field to record sound. This recording method makes dynamic microphones cheap and durable; two things that are attractive when designing a mobile, commercial solution [25]. Diaphragm Condenser Microphones Diaphragm condenser microphones use a capacitor to convert vibrations into electrical current, making the microphone highly sensitive. This could be particularly useful in cases where the source of a sound is quiet or distant, such as glass breaking or footsteps from the other side of a building. Many diaphragm condenser microphones also allow the user to choose the desired polar pattern. This allows for the use of an omnidirectional polar pattern, giving the microphone/microphones the ability to listen to 360◦ . However, the price of these microphones is higher than that of a dynamic microphone [25]. 2.6.2 Visual Sensors Cameras used for image processing systems are usually categorized as either industrial/machine vision (MV) cameras or network/IP (Internet Protocol) cameras, and both have their benefits and disadvantages. Network Cameras Network cameras are frequently used in surveillance applications and sometimes in combination with industrial cameras. These are typically placed in robust casings designed to withstand harsh weather and jolts, making them suitable for outdoor and indoor use. They usually have a variety of day and night modes and 2.6. Sensor Possibilities 15 infrared filters that deliver high image quality consistently, even under poor lighting and weather conditions. These cameras compress the images they record to reduce the volume of data being transmitted over the network. These cameras, when connected to a network, can theoretically have an unlimited number of users access the feed at the same time [5]. Industrial Cameras Industrial cameras, send raw uncompressed data directly to the computer to which they are connected. This computer is then responsible for processing a large volume of incoming data. The benefit of this is that no image data is lost in compression. Industrial cameras are usually divided into two categories, line scan, and area scan cameras. These are relevant in different computer vision applications and capture images differently. Line scan Cameras Line scan cameras use image capture sensors arranged in a single-, or a couple of lines of pixels, where the image is captured line for line and finally constructed into a complete image in the processing stage. Line scan cameras are typically used for scanning objects that move in front of the sensor, for example on a high-speed conveyor belt. These cameras are used in many printing, packaging, and surface inspection industries. Area scan Cameras Area scan cameras use rectangular image-capturing sensor arrangements, where the entire image is captured simultaneously. These cameras are found in many industries, such as medical, traffic, and security[5]. Colour Camera Most color cameras work by having a single CMOS or CCD sensor overlaid with colored filters that cover each of the pixels, making the pixels alternate between being sensitive to red, green, and blue. The mosaic pattern typically used for this is called the Bayer pattern. The resulting mosaic contains twice as many green pixels compared to blue or red because this mimics the greater sensitivity to the green light of the human eye. The Bayer pattern is illustrated in Figure 2.5. 2.6. Sensor Possibilities 16 Figure 2.5: The Bayer pattern, used in most color cameras[30]. This also means that if a red light hits a cell that is sensitive only to green light, that information will be lost. Lost information can, through a variety of different algorithms, be interpolated from adjacent cells. This process of combining cell information to make an image is called demosaicing [30]. This demosaicing conversion from Bayer pattern to RGB is very CPU intensive and is usually done by the FPGA (Field Programmable Gate Array) of a frame-grabber instead, which is an electronic component that can carry out this conversion. Single-sensor color cameras have the advantage that the electronics are identical to a monochrome camera, where only the sensor has to be modified with color filters, making them very inexpensive and popular[19]. Monochrome Camera Monochrome cameras might not seem like the best type of cameras for image recognition as the image it outputs does not contain any color but only gradients. This is, however, also the monochrome cameras’ benefit; while it can not capture any color, it can capture all the light hitting the sensor, which results in a better quality image with more detail. No demosaicing is needed to create the final image as well. Many image processing techniques also involve gray-scaling, which is the process of taking a color image and transforming it into an image consisting only of shades of gray. Using a monochrome camera would eliminate this process entirely. Although this eliminates the need for a gray-scaling process, the image output of the monochrome cameras would also be significantly smaller in byte size than their colored counterparts. Since the output images are smaller, this also means that the images require less processing time. The processing time is an important factor to consider if the goal is to run image processing in real-time. Monochrome cameras 2.7. Subconclusion & Problem Formulation 17 also have better low-light performance, since they are able to take in more light per photocell, this is also a benefit as there is a chance that the robot will be deployed in a poorly or non-lit environment[12]. Thermographic Camera A thermographic or infrared camera, as the name implies, is a camera that creates an image using infrared radiation. This means that it is able to see heat signatures. Thermographic cameras are usually very expensive compared to their non-thermographic counterparts. A reason for using a thermographic camera is that they are able to see in total darkness. As it is possible that the robot might be deployed in a non-lit environment, a thermographic camera might be the only camera type capable of detecting a human intruder. On Image Size and Processing Speed As mentioned above, images are often grayscaled in image processing applications as a method of reducing the data amount. This is advantageous since it decreases the processing time of the image. Another way of decreasing the processing time is to decrease the image quality. In some applications, high image resolution is not needed, meaning the image resolution can be decreased without a significant drop in program functionality. In other applications, a high frame rate may not be needed. If each frame has to be analyzed, the frame rate is a key factor in the maximum permissible processing time. Thus, decreasing the frame rate will increase the amount of time that can be used for image processing. 2.7 Subconclusion & Problem Formulation There are many reasons to use alternatives to either supplement or partly replace security guards in property security. This alternative would have to be capable of locating potential intruders to be effective. The different alternatives to using security guards have their advantages and disadvantages. In this project, a mobile robot platform is chosen as the security solution of choice, due to the increased flexibility of this solution. Additionally, this project will focus on the perception system of this solution, as this is a more complex matter. From this, the following problem formulation is formed: 2.7. Subconclusion & Problem Formulation 18 How can a perception system be made to detect humans for a mobile robotic platform? The system will be utilizing both sound and visual sensors since this has been deemed as a better system according to section 2.3.4. When specifically considering the sound sensors, these will focus on detecting sound anomalies (sounds of interest) from the environment. In this report, a prototype will be constructed and validated by a use case described in chapter 7. The results of this validation will be discussed in Chapter 8. Here, the solution will be compared to both current robotic solutions, along with the use of a standard human security team, to determine the validity of the proposed product. Chapter 3 Requirements This chapter will outline the requirements of the solution, which will be the groundwork for the actual system design. All these requirements will be design requirements intended to showcase the desired functionalities of the system. These will later be addressed and converted into functional requirements with measurable success criteria. No. System Requirement 1.1 ISO Compliance 1.2 Autonomous Navigation 1.3 Obstacle avoidance 1.4 Positional Awareness Description The robotic system shall be in compliance with all relevant ISO standards. The robot must be able to navigate its designated known environment without human interference. This will include navigating between some predetermined points. The robot must avoid and navigate around 95% objects in its chosen path. When traveling in a known environment, the robot must know its current approximate position Table 3.1: General requirements of the mobile robot platform The general system requirements of the robotic mobile platform are outlined in table 3.1. These requirements will not be directly addressed in this report, but they do affect the requirements of the perception system. The requirements of the perception system are outlined in table 3.2. These requirements are the main groundwork for the solution concept. These requirements will be further addressed in Section 4.4.2, which will outline success criteria based on 19 20 the chosen solution. Anomalies are defined in section 5.1. No. Perception System Requirement 2.1 Anomaly Detection 2.2 Anomaly tion Classifica- Description When an environmental anomaly occurs within range (of the robot’s sensors), the robot must detect the anomaly. When an anomaly is detected, the robot must identify if the anomaly is a human. Table 3.2: Requirements for the perception system Chapter 4 Design Concept This chapter will cover the different possible design possibilities for the microphone array and camera setup. Only microphone and camera use will be considered as relevant possibilities, as no other reasonable sensor types have been identified. Additionally, the possibility of using multiple sensor types will be considered. Lastly, the chosen designs will be combined into a holistic design concept in section 4.4. 4.1 Camera Setups This section will cover single camera setups and camera arrays. These setups can be combined with any of the camera types mentioned in section 2.6.2. Single Camera The single camera is the simplest setup, the benefit of the single camera is the ease of use and setup. No calibration is needed, as there is only one input in this setup. Another advantage is decreased production cost. There are, however, some limits when using a single camera, such as the resolution and frame rate being limited to what the camera is able to output directly. However, as mentioned in section 2.6.2, this quality may not be needed. Using a single camera also limits the field of view of the optical system to that of the single camera. This can be detrimental when used in solutions where the surroundings of the robot are of concern. However, many stationary surveillance cameras use lenses, like the fisheye lens, to increase their field of view to be able to cover additional areas. 21 4.1. Camera Setups 22 Figure 4.1: An example of the increased field of view that a fisheye lens.[31] Camera array A camera array is a collection of cameras calibrated to produce a single image. Usually, individual cameras are of lower cost, but with calibration and software processing, it is possible to combine the lower-quality images into high-quality images or, alternatively, a picture of a larger field of view than a single camera could achieve. This can be very useful if there is a need for an image of all directions in a given environment. Though the setup of the camera array is relatively simple, the calibration and software processing of the array is out of scope for a project such as this, due to its complexity. Figure 4.2: An example of an image captured on a camera array, spliced from multiple images.[44] 4.2. Sound localization 4.2 23 Sound localization Sound localization is a field within signal processing that deals with identifying the origin of a detected audio signal, with respect to an array of microphones[32]. The ability to estimate the direction of a sound is vital to many biological organisms, where it serves as an alert to dangers and predators or, in predators, is used to locate prey. Sound localization also has many different engineering applications and has become a large and complex field, in which humans attempt to recreate artificially that which the animal kingdom has perfected[36]. Sound localization is an important field that has seen many different applications such as sound source separation, soundtracking, and speech enhancement technologies. In robotics, it can be useful to be able to determine the direction of and distance to a sound source, especially in social- or security robotics. This section will explore and analyze the advantages and disadvantages of some existing sound localization methods to determine which could prove the most viable in a mobile security robot. Basic Principles Typically, sound localization in electronic systems is done by using two or more microphones in an array and using the difference in arrival times of a sound at the two microphones to determine the direction of arrival (DOA). This time difference is called the interaural time difference[32]. The accuracy of a microphone array’s ability to determine direction is fundamentally limited by the physical size of the array. If the microphones in the array are too closely placed together, the interaural time difference will be near zero, making mathematical estimation of direction extremely difficult. It is not uncommon for the distance between microphones in the arrays to be 10-30 centimeters apart, which has consequences for the size of the array[36]. Physically large arrays can become impractical to use on small robotics, and even for large robots, such microphone arrays can be inconvenient to mount and maneuver. Large separation between microphones is required to detect lowfrequency contents of audio signals, but small distances are required to address spatial aliasing. This poses another challenge for designing microphone arrays, as the spacing between microphones is not arbitrary. Spatial aliasing will be further elaborated in section 5.1. The precision of sound localization using microphone arrays has been found to increase with the use of more microphones, which in turn increases the cost of the array[32]. This is an example of some of the problems that are encountered in the physical setups of sound localization microphone arrays. 4.2. Sound localization 24 Monoaural Localization Monoaural localization refers to the use of a single "ear" or microphone to determine the direction of a sound source. As mentioned previously, sound localization in artificial systems is typically done by using two or more microphones. In contrast, being able to use a single microphone holds the potential to decrease both the size and cost of a microphone array significantly[36]. Sound localization with a single microphone, however, is very inaccurate and complex because it requires prior knowledge of possible sounds, and in a narrow mathematical sense, it is actually impossible to determine the direction of sounds with a monoaural recording alone[36]. To combat this, monoaural microphone arrays typically use artificial pinna or auricles, which refers to the outer part of the ear in animals which can be seen in figure 4.3. Figure 4.3: An illustration showing how sound from different directions is affected by an artificial pinna structure [46]. The pinna is able to change the way a known sound is perceived and change the spectral shape of the sound depending on the direction it is coming from. Humans are automatically trained to recognize this change throughout their lives and become better at sound localization of known sounds through exposure, but recreating this artificially is very difficult. Some studies have suggested using machine learning to train an algorithm to be able to do this reliably with relative success[36]. This sound localization setup, however, has proven to have some challenges[37]. Firstly, the use of an artificial pinna means that there exists a range 4.2. Sound localization 25 of angles from which the algorithm has difficulty estimating the direction of the source of the sound. In a Stanford University test [37], this problematic angle ranged from 235◦ to 345◦ , which constitutes nearly a third of the possible sound directions. Additionally, the average error of the experiments ranged from 4.3◦ for wideband noise-like signals to 18.3◦ for naturally occurring sounds such as dog barks. Assuming that the audio source is constant and stationary, it is possible to perform monoaural audio localization by moving the microphone array. By introducing movement to the microphone array, it is possible to emulate having multiple microphones in the array, because the changing position yields readings from different positions relative to the audio source. Mathematically, assuming accurate position estimation is indistinguishable from the sound source localization arithmetic used for estimating direction in multi-microphone arrays, with the exception of compensating for the travel time between samples. Figure 4.4: Example of monoaural localization with movement Binaural Localization Binaural literally means "having or relating to two ears" and binaural localization refers to sound localization by using two microphones. Binaural localization primarily uses interaural time differences as a cue for sound localization. This is a phenomenon used in most mammals to determine sound direction in the azimuth 4.2. Sound localization 26 plane. However, this binaural cue cannot be used to determine the elevation of a sound source as it suffers from front/back ambiguity[36]. A sound source placed directly in front of a binaural microphone array is indistinguishable from a sound source placed directly behind the array, as the interaural time difference is zero in both cases. Much like the monoaural microphone arrays, binaural microphone arrays can use artificial pinna to distort and reflect known sounds depending on the sound source’s direction, making it possible to more accurately determine the direction of the sound source in three-dimensional space. This is done using a Head-Related Transfer Function which also takes into consideration the shape of the head when calculating the time it takes the sound to reach the furthest microphone [26]. The ambiguity of this solution can be avoided by using more than two microphones. However, more microphones come with more complex algorithms and a need for larger computational capacity. By having more than two, but still minimizing the use of microphones it is possible to reduce the complexity of both the algorithm and the computation [32]. Tests of binaural microphone arrays were made using three different methods. Multi-microphone array Multi-microphone arrays are comprised of three or more microphones. This is generally for the purpose of increasing accuracy and removing the need for pinnae. This does however come with an increased cost and complexity; both arising from the increased number of microphones. While adding more microphones generally will add a higher level of accuracy, it will also increase the costs of the system, both directly, through having to buy more microphones, and indirectly, through needing more processing power and more complex code[22]. A multi-microphone array uses the same fundamental principle for sound localization as a binaural microphone array, using the interaural time difference[22]. To determine the direction of the sound in 3D space, the minimum required number of microphones is four. These microphones should be set up as the vertices in a tetrahedral shape. This setup eliminates front-back ambiguity across all planes. If the elevation of the sound is not needed, sound localization can be achieved using only three microphones forming a triangle lying in a position parallel to the azimuth plane. Some high-end microphone arrays use ultradirective microphones arranged on a sphere, which allows for a very robust sound localization setup. 4.3. On the use of multiple sensors 27 One of these types is the Eigenmike ® , which is a spherical 32-microphone array that is able to detect and isolate multiple sound sources with selective hearing. 4.3 On the use of multiple sensors The main reason for the use of multiple sensors is that it allows for confirmation of anomaly detection by one of the sensors. When only using audio, there is a significant chance of false positives when detecting sound anomalies that are not intruders. Additionally, there is no way to locate an intruder if the intruder is not making enough sound. When only using video, there is a limitation in the area that is surveyable, limited by the camera’s field of view. As such, the camera has difficulty covering a large area simultaneously. However, when using video and audio in conjunction, the severity of these problems is reduced. A solution using this multisensor method can use one detection method as a confirmation of the anomaly detected by the other. 4.4 4.4.1 Concept Selection Microphone Selection Because of the inherent inaccuracy of audio source detection using a single microphone setup, this option can be discarded as inviable. The Binaural microphone array requires the use of pinnae, which makes it a reason to discard this option as well, due to needlessly increased complexity. Based on the information available, a 3-microphone array has been determined to be the most suitable solution for the implementation needs of this project. Fewer microphones mean less space and resources required, along with a decreased complexity. Using a 3-microphone setup also removes the main issue with a 2-microphone setup, being that calculations using the TDOA method give two potential anomaly source locations. It should be noted that only using three microphones will have a consequence on the accuracy of the system compared to a system using 8 microphones [29]. Three microphones also limit sound source detection in the azimuth plane, meaning it lacks the ability to locate whether the sound is above or below the microphone setup. In this project, the three microphones are positioned on the points of an equilateral triangle to ensure equal sensitivity to all angles in the azimuth plane. 4.4. Concept Selection 28 Camera Selection For the camera selection, there are a lot of options. Usually in motion detection and object recognition webcams or Kinect cameras are used, however, the expected methodologies (explained in Section 5.3), only require the use of an RGB camera. This means that the larger financial investment of a Kinect camera array is unnecessary. The lower cost of an RGB camera gives the robotic solution a decreased production cost, making the solution more financially viable, and a more realistic option for smaller companies with less financial flexibility. 4.4.2 Functional Requirements In this chapter, functional requirements will be outlined based on the requirements of Chapter 3 (see Table 4.1). These requirements will continue in the format of an identifying number, a title, and a description. The description will also include the original requirement upon which the functional requirement is based. 29 4.4. Concept Selection No. System Requirement 3.1 Precision of Sound Anomaly Detection 3.2 Recall of Sound Anomaly Detection 3.3 Sound Anomaly Identification 3.4 Human Identification in well-lit environment 3.5 Human Identification in poorly-lit environment Description When a sound anomaly occurs within 5 meters of the perception system, the robot must detect the anomaly with 90% precision. Based on Requirement 2.1. When a sound anomaly occurs within 5 meters of the perception system, the robot must detect the anomaly with 90% recall. Based on Requirement 2.1. When a sound anomaly is detected, the perception system must identify the direction of the sound with a deviation within half the FOV of the camera minus 10° in the azimuth plane. This should ensure that any movement is in line of sight of the camera. Using this deviation the sound source localization must be able to identify the anomaly angle within the deviation 90% of the time Based on Requirement 2.1. When a human is seen on the camera feed, the perception system must correctly identify the human 95% of the time in a well-lit environment. Based on Requirement 2.2 When a human is seen on the camera feed, the perception system must correctly identify the human with 80% of the time in a poorly-lit environment. Based on Requirement 2.2 Table 4.1: Functional Requirements of the robotic system Chapter 5 Methodology This chapter will cover different methods for sound anomaly detection, sound source localization using one, two, or multiple microphones and motion detection. It also includes which of the methods chosen for anomaly detection, sound source localization and motion detection in this project. 5.1 Anomaly Detection Anomalies are defined as something that deviates from what is normal or expected, which in relation to this report can be sounds of windows breaking, forced entry, or video feeds of suspected burglars. Automatic video and audio analysis can detect anomalous patterns in surveillance videos and is called anomaly detection[28]. Anomaly detection is very useful, as it can serve as a pre-alarm or a signal to security personnel that they should monitor a certain camera feed. This significantly increases the amount of surveillance a single person can perform[35]. 5.2 Audio Analysis Audio analysis in particular has emerged as a relevant tool for improving the security of public and private assets. In fact, in many cases, the analysis of audio signals from microphones in a surveilled area deployed to detect anomalous audio signatures has been proven to be more reliable than the video analysis counterpart of the same area[10]. Audio analysis refers to the extraction of information and meaning from audio signals for analysis, classification, and storage. Audio analysis extracts data that 30 5.2. Audio Analysis 31 represents analog sounds in digital form, preserving the main properties of the original sound. Sounds have three key characteristics to be considered when analyzing. Time period, amplitude, and frequency. Audio signals can be visually represented and are most commonly done by converting the signal to the waveform, spectrum, and spectrogram representation, seen in Figures 5.1, 5.2 and 5.3. Waveform The waveform is the most common way of representing sound and is encountered in most recording software. The waveform representation is a graph that maps the amplitude of a sound over time. Figure 5.1: A visual representation of the waveform of a signal[24]. Spectrum A spectrum representation shows the frequency content of a sound signal, where the frequency is represented along the x-axis and the amplitude of the signal is on the y-axis. Natural sounds contain a wide range of different frequencies. Tonal sounds contain a fundamental frequency and a range of overtones that are multiples of the fundamental frequency. Usually, it isn’t possible to hear the individual overtones as the fundamental frequency and the overtones combine, and their individual amplitudes and the relationship between them play an important role in the perceived tone color or timbre of the tone. The tone color is what makes it possible to distinguish between the same tone originating from different sources. These fundamental tones and overtones are visualized in the spectrum representation in figure 5.2. 32 5.2. Audio Analysis Figure 5.2: The spectrum of a sound signal, with fundamental frequency and overtones marked[24]. Spectrogram A spectrogram is a visual representation of the spectrum of frequencies as it varies over time. Spectrograms are also known as sonographs or voiceprints. Spectrograms are used extensively in the field of audio analysis with many different applications. As seen in Figure 5.3, the spectrogram is usually represented as a heat map, where the intensity is shown by varying colors in the image. Figure 5.3: A spectrogram of a sound signal[24]. Visual representations are rarely sufficient to extract meaningful information about an audio signal, and a numerical approach is necessary. In almost any audio analysis, the Fourier transform plays a large role. 5.2.1 Fourier Transform in Audio Analysis The Fourier transform(FT) is a mathematical transform that decomposes functions into frequency components, which is represented in the output as a function of frequency. That is, often from the time or space domain to the frequency domain and vice versa. This is useful in many types of signal processing, as it allows one 33 5.2. Audio Analysis to isolate certain frequencies from a signal to suppress, enhance, or analyze them. An example of this particular application of the Fourier transform, is signals with high-pitched noise, where the high pitches can be isolated and suppressed and the use of the inverse Fourier transform can reproduce the signal without the noise. In short, the Fourier transform makes it possible to view a signal as the sum of several pure sine waves of different frequencies and amplitudes. An example of the Fourier transform output can be seen in Figure 5.4. Figure 5.4: Visualisation of the output of the Fourier transform[6]. One of the common conventions for defining the Fourier transform of some integrable function f : R −→ C is the following: Z ∞ −∞ f (t)ei2πξt dt ∀ ξ ∈ R The Fourier transform works by mathematically winding the graph of f around the origin of a cartesian coordinate system with a variable winding frequency ξ. This will be referred to as the winding frequency. The graph is assigned some point, which denotes the center concentration of mass of the wound graph. When the winding frequency approaches the frequency of f or the frequency of one of the components of f , the center of mass will become noticeably displaced from the origin along the x-axis. The displacement of this point from the origin is mapped to another graph as a function of the displacement and winding frequency. This function is the output of the Fourier transform. In figure 5.5 the original signal with frequency 3 can be seen in yellow, and the corresponding windings at different winding frequencies can be seen below. The x-coordinate position of the center of mass can be seen as a function of winding frequency in the red graph. 5.2. Audio Analysis 34 It can be seen that the displacement is significant at exactly the frequency of the signal. This is what makes the Fourier transform able to decompose signals into their components. This also works for composite signals, where the displacement peaks would be found at the frequencies of the components. Figure 5.5: Visualization of how the Fourier Transform works[2]. A digital computer cannot work with continuous-time signals directly, so it is necessary to take some samples and analyze these samples instead of the original signal. This yields a discrete sequence of samples, sampled at some frequency from the Nyquist-Shannon theorem. The Discrete Fourier Transform (DFT) is the discrete version of the Fourier transform that transforms a discrete sequence, like a sequence of samples, from the time-domain representation to the frequency-domain representation. A more popular use of the DFT is the Fast Fourier transform (FFT) which is any efficient computation of the DFT or its inverse. A fundamental flaw of the discrete Fourier transform is, that it is computationally intensive, as it requires O(n2 ) computations. However, the fast Fourier transform uses clever mathematics to reduce this to O(n log n) computations, meaning that a Fourier computation with the discrete Fourier transform that would have taken over 3 years could be done in 35 minutes with the FFT. 35 5.2. Audio Analysis 5.2.2 Using The Fourier Transform To Detect Sound Anomalies To detect a sound anomaly, it needs to deviate from normal sounds. This is where the Fourier transform is useful since it allows analyzing of each individual frequency in a sample of sound. To detect a sound anomaly, a short sound sample is recorded and then a Fourier transform is performed on it to split it up into its frequency components and their respective amplitudes. To then determine if the recorded sample has unusual sounds, a sort of sound template has to be created that the sample sound can be compared to. One way of creating the template is to record a large number of background noise samples, perform Fourier transforms on them, and then save the largest values that are found at each frequency. The template will then consist of all the loudest background noises. This means that it can also be used in places where there are somewhat loud noises but still be able to tell if there are more quiet sound anomalies. 5.2.3 Nyquist-Shannon Sampling Theorem Audio signals are continuous-time (analog) signals, which can be stored on computers in the form of discrete equidistant points, called samples, in a function of discrete time or space. The higher the sampling rate, the higher the accuracy of the signal reconstruction and stored information. However, high sampling rates generate large volumes of data to be stored and processed, which can require a lot of computational power to handle. Musical audio signals, for instance, are rich in high frequencies and require high sampling rates upwards of 44.100 samples per second[9], while other signals require only a fraction of this. Therefore, it is crucial to select an appropriate sampling rate for a given signal in order to record sufficient information about the signal to recreate and analyze it. The question then arises: What is the minimum necessary sampling frequency for a given type of signal that allows for accurate reconstruction and preservation of data? The answer is provided by the Nyquist-Shannon sampling theorem, which states that: "The minimum sampling frequency of a signal, so that it will not distort its underlying information, should be double the frequency of its highest frequency component."[9] Suppose that X (t) is a band-limited signal. Bandlimited means that for the Fourier transform of this signal, X̂ ( f ) = F { X (t)}, there would be a certain f max for which | X̂ ( f )| = 0 ∀ | f | > f max 36 5.2. Audio Analysis so that there is no power in the signal beyond the maximum frequency f max . The Nyquist theorem then states that to sample this signal, it would be necessary to sample with a frequency larger than or equal to twice the maximum frequency contained in the signal, that is: f sample ≥ 2 f max If this is the case, no information is lost during the sampling process, and the original signal could theoretically be reconstructed from the sampled signal. 5.2.4 Aliasing Aliasing is the effect that happens when different signals appear similar when sampled. It can also occur when the reconstructed signal from the samples is different from the original continuous signal. An example of aliasing can be seen in figure 5.6. Figure 5.6: An example of aliasing[9] The blue wave is the signal that is sampled, and the red bars are where the signal is sampled. The green sine wave in figure 5.6, is the reconstructed wave from the sampled points. The green wave is aliased. This occurs when the sampling frequency is less than two times the highest frequency from the sampled frequency. As is obvious from the image, the green wave is not an accurate representation of the blue wave, this proves the importance of a high sampling frequency to obtain an accurate representation of the sampled signal. 37 5.3. Motion Detection 5.2.5 Spatial Aliasing Spatial Aliasing is a type of aliasing. For example, it can be a problem when trying to locate a sound source using a microphone; it will occur when the distance p between microphones in a linear setup is phased aligned with the sound source. This happens if the wavelength of the sound source happens to be p. This can lead to direction ambiguity and, as such, must be addressed. The ambiguity can be addressed in one of two ways[48]. Either of these two conditions needs to be met 2p < λ (5.1) v 2p (5.2) or f < Where v is the speed of sound, λ is the wavelength of the sound, and f is the frequency of the sound From the equation 5.2 it is evident that in a linear microphone setup, the maximum frequency that this linear array can handle without ambiguity is f < /(2p). So while spatial aliasing might not pose a problem, it is something to be aware of. 5.2.6 Sampling Rate The human ear has a best-case frequency range from 20 Hz to 20 kHz and it is very likely that the majority of anomalous sound events can be captured entirely in this spectrum. As stated by the Nyquist-Shannon theorem, the minimum sampling frequency of a signal, that does not distort or lose its underlying information, should be double the frequency of its highest frequency component. As the highest frequency component in this spectrum is 20kHz, it is necessary to sample at a frequency of at least 40 kHz in order to be able to record and reconstruct the signal. Music, for instance, is typically sampled at 44.1 kHz and recorded at 48 kHz to leave room for the anti-aliasing filter that is used in analog-to-digital converters. Because of this, the microphone array for this project will be sampled at 48 kHz, which is supported by the field recorder mentioned in section 6.1.2. 5.3 Motion Detection Motion detection is the process of detecting and tracking the movement of objects or persons, in relation to their surroundings, in a video feed. There exists a plethora 5.3. Motion Detection 38 of different methods for motion detection, and this section will outline three of the most common methods, after which one will be chosen for use in this project. • Background Subtraction: Background subtraction works by comparing frames and subtracting one frame from another. Usually, a background frame is chosen and all other subsequent frames are compared to the background frame. Compared, in this respect, means taking the absolute value of the subtraction of the other frame to the background frame, this is to avoid integer underflows as an unsigned 8-bit integer only goes down to 0. This results in an image where the differences between the background frame and other frames are obvious. This frame can then be thresholded to make the differences more evident. The thresholded image can then be used to draw bounding boxes around the changes between the background frame and the other frames. Though it is very simple to implement, background subtraction has its drawbacks such as it being excessively sensitive to scene changes such as light, or other foreign events. There exist background methods that can combat these problems, they will however not be mentioned here.[38]. • Temporal differencing: Temporal differencing shares some of its process with background subtraction, but is still different. The main difference between the two methods is that in temporal difference the current frame is compared with the previous frame. This makes temporal differencing better suited for a non-stationary camera, as it doesn’t rely on a predetermined background for finding motion. Temporal differencing is very robust in dynamic environments. Although temporal differencing can suffer from poor performance in extracting all relevant feature pixels (i.e. pixels that include the features that should be identified) of the object of interest, usually techniques such as morphological operations and hole filling are used to rectify the aforementioned problems.[38] • Optical flow: The optical flow method can be used to detect moving objects and even their direction and velocity. Simplified, the general technique works by tracking pixels over a short span of time and then calculating the image derivative and drawing a vector on it. This vector contains the direction and velocity of the object[33]. It should be mentioned that optical flow motion detection is computationally intensive, while at the same time being very sensitive to 5.3. Motion Detection 39 light and scenic changes. This makes it unideal for a moving robot[38]. Conclusion Based on the previous section, background subtraction has been chosen for motion detection. This is because it is easy to implement and the process is very well documented. Optical flow is too computationally intensive to implement on a turtle bot. Though background subtraction has its own challenges such as its sensitivity to light and environmental changes, so while not a perfect candidate it is the best of the three mentioned methods. A more in depth method for doing background subtraction Image- and video processing follow very similar procedures, the main difference being that an image is only a single frame while video processing is a stream of images. When doing any sort of image processing in regards to motion detection, it is common practice to grayscale the input, as this can cut the processing time to 1/3. After gray-scaling the image, it should be blurred with a Gaussian filter to smooth the image. This is because it averages the pixel color intensity, which is important as it smooths out high-frequency noise that could interfere with motion detection. This high-frequency noise often originates directly from camera input, usually dark regions of an image will contain most of the high-frequency noise. The images are then compared by selecting a "first frame" and comparing it with the subsequent frame. This is done using image subtraction and thresholding to reveal regions of significant changes in pixel values. A diagram of the process can be seen in 5.7. Figure 5.7: A diagram of the background subtraction process [16] 5.4. Auditory localization 40 Using contour detection, it is possible to find the outlines of these regions in the thresholded image. Lastly for easy visualization, a bounding box is drawn around the area of motion. This works well for stationary cameras, but as the robot will be moving, traditional background subtraction is not sufficient. However, there are techniques for adapting background subtraction to moving cameras. One technique to adapt background subtraction to a moving camera is to compensate for the global motion of the camera as if the camera was stationary. This can be done using block matching. In general, block-matching works by first dividing the frames of the video into blocks, the algorithm will then try and match these to the previous frame. If the algorithm finds a match, a vector is created that skews the block onto the previous block. In theory, this vector could be used to compensate for the global motion of the camera. 5.4 Auditory localization In regards to auditory localization, you first need to make some assumptions about the spreading of sound waves, also known as the propagation of sound, to simplify the complex nature of the subject. There are two assumptions that should be made, firstly far field versus the near field and secondly free field versus the diffuse field. One of each should be chosen, and the combination will simplify the complex matter of sound enough to perform calculations on it. In the near field, it is assumed that the sound source is very close to the microphone, within one wavelength. Within this distance, the sound waves are complex because they circulate back and forth, never escaping. This means there is no fixed relation between pressure and distance. In the far field, you assume that the sound is further away, between 1 wavelength and infinity. When the sound source is this far away from the microphone array, it is safe to assume the wavefront as being perpendicular to the sound source [40]. The difference between free field and diffuse field is that free field assumes that there is nothing around to reflect the sound waves back, which simplifies the calculations because there are fewer waves to account for. The diffuse field assumes that there are walls around to reflect the sound back to the listener multiple times, making it appear like there is no single sound source [40]. Once you have assumed either near field or far field and free field or diffuse field you can choose which method to use to determine the Direction of Arrival (DOA). The most common way of finding the DOA of sound is using the time difference of arrival (TDOA) 41 5.4. Auditory localization between a number of microphones. TDOA is often used because of its ability to be applied to broadband signals (wide frequency range), but also because of its accuracy and simplicity, which means that it uses very little computational power. For these reasons, the TDOA method was chosen to find the DOA in this project. There are different methods of using this TDOA to find the DOA, the two most common are triangulation and steered response power (SRP). Triangulation uses the geometry of the microphone positions to calculate the DOA, while SRP is a beamforming-based method. Due to the simplicity and low computational requirement of triangulation, it was chosen for this project. 5.4.1 Determining TDOA To determine the time difference of arrival(TDOA) of audio signals recorded from different microphones with a short relative physical displacement, it is common to use audio cross-correlation (or cross-variance). Cross-correlation consists of the displaced dot-product of two signals and it is most commonly used to quantify the degree of similarity between two signals. It is used, in other words, to compare two signals. As the sample index n of the correlator is incremented, the output of the correlator is a similarity score that compares two signals at two different time shifts. This produces two important results that first imply how much one signal resembles the other at any given time shift, but secondly, also at which timeshift the peak similarity is. This means that the value of the cross-correlation will be maximal when the signals are time-aligned, resulting in the most amount of overlap. For signals that have been evaluated in discrete time, the correlation between two signals x and y with the same N samples length is expressed by the following expression: N Corr { x, y}[n] = ∑ x [m] · y[m + n] m =1 Once cross-correlation has been performed on two sound signals, it yields a graph of similarities at different time shifts. The largest peak in the crosscorrelation is usually very close, or equal, to the sampling rate of your microphones. The index number is then used in this equation TDOA = IndexO f Max − SamplingRate SamplingRate (5.3) 42 5.5. Farfield DOA proof which yields a number, in seconds, that is usually very low and equal to the TDOA.[4] 5.5 Farfield DOA proof The positions of the microphones are set up in an equilateral triangle. The microphones’ positions P1 , P2 , P3 are defined in equation 5.4 and are shown graphically in figure 5.8a. " # 0 P1 = , 0 # " − sin( π6 ) , (5.4) P2 = D − cos( π6 ) " # sin( π6 ) P3 = D − cos( π6 ) Where D is the edge lengths of the equilateral triangle. The sound wave is assumed to be from a large enough distance, so that it can be considered as a straight line or a wall, with direction ⃗v: " # sin( φ) ⃗v = − cos( φ) The graphic representation of ⃗v can be seen in figure 5.8b. From here, a line representing the sound wave has to be found. Firstly, we have to make an assumption, that the soundwave hits P1 at time t = 0. Since the direction ⃗v is a normal vector to the wavefront line, we can compute the wavefront line, l, as: l: sin( φ) · ( x − 0) − cos( φ) · (y − 0) = 0 ⇔ l: (5.5) x · sin( φ) − y · cos( φ) = 0 This is illustrated in figure 5.9a. In order to figure out the times at which the sound wave will hit P2 and P3 respectively, we have to know the distance between the sound wave and the points. The distance between a line l : ax + by + c = 0 and a point P( x0 , y0 ) can be calculated as: | ax0 + by0 + c| √ dist( P, l ) = a2 + b2 43 5.5. Farfield DOA proof (a) Positions of P1 , P2 , P3 (b) Microphone positions with ⃗v Figure 5.8: The positions of microphones in an equilateral triangle (a), and the vector indicating sound wave direction (b). Inputting the values from equations 5.4 and 5.5, we can find the distances dist( P2 , l ) and dist( P3 , l ) as: | − sin( φ) · D · sin( π6 ) + cos( φ) · D · cos( π6 ) + 0| q cos2 ( φ) + sin2 ( φ) π π = D · | − sin( φ) · sin( ) + cos( φ) · cos( )| 6 6 π = D · | cos( φ + )| 6 dist( P2 , l ) = (5.6) (5.7) (5.8) and | sin( φ) · D · sin( π6 ) + cos( φ) · D · cos( π6 ) + 0| q cos2 ( φ) + sin2 ( φ) π π = D · | sin( φ) · sin( ) + cos( φ) · cos( )| 6 6 π = D · | cos( φ − )| 6 These distances are illustrated in figure 5.9b. dist( P3 , l ) = (5.9) (5.10) (5.11) The last important step when calculating the distance to the sound wave is to figure out whether or not the point is "behind" or "ahead of" the wave (i.e. whether the point has already been hit at time t = 0 or not). For this the vector projection formula can be used: ⃗a · ⃗b ⃗ ⃗a⃗b = ·b ||⃗b||2 44 5.5. Farfield DOA proof (a) The soundwave at t = 0 (b) Distance from soundwave to P2 and P3 Figure 5.9: A blue line indicating the sound wave (a), and the distances from the sound wave to the points P2 and P3 as red dotted lines (b). What can be gathered from this equation is that the factor ⃗a·⃗b ||⃗b||2 determines whether ⃗a is "behind" or "ahead of" ⃗b. If the factor is less than 0, ⃗a is "behind" ⃗b and vice versa. As a mathematical representation, the distance will be negative when "behind" ⃗v and positive otherwise. Applying this to the vectors P2 and P3 and ⃗v, it can be seen: and π P2 · ⃗v = D · cos( φ + ) 2 ||⃗v|| 6 (5.12) P3 · ⃗v π = D · cos( φ − ) 2 ||⃗v|| 6 (5.13) Comparing equations 5.6 and 5.12, and equations 5.9 and 5.13, we can gather that: Pi · ⃗v ||⃗v||2 = dist( Pi , l ), i = [2, 3] (5.14) It can also be gathered that the angular distances d2 and d3 are: ( di = −dist( Pi , l ), Pi ·⃗v ||⃗v||2 <0 dist( Pi , l ), Pi ·⃗v ||⃗v||2 ≥0 , i = [2, 3] Using the information from Equations 5.14 and 5.15, it can be seen that: d2 = D · cos( φ + π ) 6 (5.15) 45 5.5. Farfield DOA proof and π ) 6 Using this and assuming that the sound wave moves at speed c it can be seen that the TDOA of P1 to P2 , t12 , is: d3 = D · cos( φ − t12 = D π cos( φ + ) c 6 (5.16) π D cos( φ − ) c 6 (5.17) and the TDOA of P1 to P3 , t13 , t23 , is: t13 = Using these two, the TDOA of P2 to P3 can be calculated as: t23 = t13 − t12 = D π π D cos( φ − ) − cos( φ + ) = sin( φ) c 6 6 c (5.18) It is important to note that the goal is to find the angle the sound wave came from ⃗ as a unit vector pointing and not the angle it is going. If we define a vector w towards the sound source, it will have an angle from the positive y-axis θ (see figure 5.10). ⃗ Figure 5.10: Direction towards sound source w 46 5.5. Farfield DOA proof Here it can be clearly seen that θ = φ. Assuming that θ ∈ (−π; π ], it can be seen that equations 5.16, 5.17 and 5.18 will always output multiple answers. It is true for a sin wave: sin( a) = sin(π − a) sin( a) = sin(2π + a) From this, it can be gathered that, for a given TDOA between mics 2 and 3 t23 , there would be two possible angles α1 and α2 of which one is equal to θ. α1 can be calculated as: c · t 23 α1 = arcsin (5.19) D Using the sinus identities, it can be seen that: α2 = π − α1 , α1 ≥ 0 − π − α1 , α1 < 0 Using this, two possible values for θ are found. These values can be inserted into equations 5.16 and 5.17. This will yield some expected TDOA values for t12 and t13 . The correct angle can be found by comparing these two to the actual TDOA values. This angle will henceforth be recorded as θ23 , as it has been derived from t23 While this may be enough in theory, the TDOAs may be inaccurate, and therefore the angle should be based on all TDOAs instead of only getting the angle from a single TDOA. In order to calculate an average angle, angles have to be calculated for both t12 and t13 . Below is an example showing how it would be done for t12 . Firstly the equation for t12 is rewritten: D π cos( φ + ) c 6 ⇔ c π arccos( t12 ) − = φ D 6 t12 = (5.20) To account for there being two solutions to the original equation, it can be found that another solution is: φ = − arccos( c 11π t12 ) + D 6 47 5.5. Farfield DOA proof or c π t12 ) − D 6 When these angles are found, the one which is deemed closest to θ1 (which is the reference angle) is deemed as the correct one. This angle distance takes into account that the angle movement is on a circle, and thus it is possible that the shortest distance crosses the line where the angle turns positive. This is illustrated in figure 5.11 φ = − arccos( Figure 5.11: Example of angle distance being shorter over the line where the angle crosses from −π to π. As clearly shown, the distance d1 is shorter. The correct angle gathered from t12 is called θ2 . Using a similar method, θ3 can be gathered from t13 . The sound DOA, θ is then deemed to be the average angle between θ1 , θ2 , and θ3 , corrected for rotations (such that θ ∈ (−π; π ]) Chapter 6 Implementation This chapter will first describe each hardware component used in the project. It will then explain each software component along with how they were integrated into ROS (Robot Operating System). 6.1 Hardware This section will cover the hardware that will be used in the downscaled prototype. The prototype will be based on a TurtleBot2, which is an open-source, low-cost, yet powerful robot kit based on the iClebo Kobuki mobile robot base. The Turtlebot comes equipped with different useful sensors and libraries. 6.1.1 Onboard Hardware The TurtleBot2 is a vertically stacked robot based on the Iclebo Kobuki mobile robot base. The base features odometry sensors with 52 ticks/encoder-revolution, a gyroscope as well as bumpers, cliff sensors, and wheel drop sensors.[34] Cliff Sensors The Kobuki robot base is equipped with three cliff sensors on the underside of the base. These are used to detect when the robot is approaching a steep dropoff in the operating environment. This enables the robot to detect and act on this type of environmental obstacle before incapacitating or damaging itself. These are installed in the front of the robot and on either side[47]. 48 6.1. Hardware 49 Figure 6.1: The iClebo Kobuki mobile robot base without a tower[47]. Bumpers Much like the cliff sensors, the Kobuki robot base comes with three bumper sensors, installed on the perimeter of the base. These are also used to detect and navigate obstacles in the operating environment. These are used for collision detection, but can contextually be used for obstacle avoidance [47]. Encoders The robot has another very useful component, rotary encoders. Rotary encoders are devices that are used in a wide variety of applications that require monitoring or control. In this case, the encoders are attached to the driving motors of the robot, which are used to provide information about the motion of the drive shaft, which is processed into information about the position, speed, and driving distance of the robot base. There exist many different types of encoders but they all serve the same purpose. 6.1.2 Additional Hardware Microphones Determining the DOA using the TDOA method and without artificial pinna takes at least three microphones to do. The microphones chosen for this project are the Behringer ECM8000 Measurement Microphone. The microphone can be seen in 6.2. These microphones are omnidirectional, meaning they can hear sound the same from every angle, which is perfect for anomaly detection and finding the 6.1. Hardware 50 DOA from any place in space. They are also condenser microphones which are well-suited for this project. Condenser microphones are explained in Section 2.6.1. They also have an ultra-flat frequency response meaning the microphones capture sounds close to an equal level for all frequencies. A frequency response of a microphone is shown in a graph with the signals frequency on the x-axis and the signals dB level on the y-axis. The flatter the graph the less the microphone will warp or alter the raw recording. Figure 6.2: The Ultra-linear Behringer ECM8000 Measurement Microphone[7]. 3D-printed microphone stand To ensure precise and repeatable positioning of the microphones in the equilateral triangle that is required by the TDOA math, a microphone stand was drawn in Solidworks and 3D-printed using PLA. The microphone stand secures the bottom of the microphones in place while allowing an XLR cord to be inserted from the bottom. A sort of lid with holes is then slid on top of the neck of the microphones and secured to the rest of the stand. The lid holds the necks of the microphones in place so that they are all precisely 8 cm apart. A foot was also 3D-printed so that it could be placed on top of the robot without sliding. Two pictures of the stand can be seen in Figure 6.3. Zoom F6 Field Recorder The Zoom F6 is a portable field recorder designed for professional audio recording in a variety of settings. It features six inputs, each with its own high-quality preamp and individually adjustable gain, making it well-suited for recording interviews, music performances, and sound effects in the field. The F6 has a 32-bit 51 6.1. Hardware (a) (L) Microphone stand foot, (C) Lid, (R) Main body. (b) Picture of the microphones inserted into the microphone stand. Figure 6.3: Pictures os the 3D-printed microphone stand. floating point resolution and a sampling rate of up to 192 kHz, allowing it to capture a wide range of frequencies and dynamics with high accuracy. For this project, the field recorder has been limited to a sampling rate of 48 kHz, which is a more than sufficient sampling rate for high-quality audio recordings. It also has built-in timecode functionality, making it easy to synchronize audio with video recordings. One of the standout features of the Zoom F6 is its ability to record in both mono and polyphonic modes. In mono mode, all six inputs are combined into a single mono track, making it ideal for recording a single source or for mixing multiple sources together. In polyphonic mode, each input is recorded as a separate track, allowing for greater flexibility in post-production and uses in audio source localization. Figure 6.4: The Zoom F6 Field Recorder[45]. 6.2. Motion detection program 52 Logitech C905 720p Webcam The webcam used for this project is the Logitech C905 which is a compact 2megapixel portable webcam capable of 720p video feeds at 30 frames per second and is able to capture 8-megapixel photos. It uses high-precision Carl Zeiss optics, AutoFocus, and light-correcting RightLightTM 2 technology to improve image quality for optimal video streaming. It is equipped with a built-in microphone which will not be used in this project. The webcam uses a Hi-speed USB 2.0 certified connector to connect to a computer. Figure 6.5: The Logitech C905 720p webcam used for motion detection [1] 6.2 Motion detection program The motion detection program has been written in Python, partly because of the many widely available open-source libraries and Python’s general ease of use. The main library used for the motion detection program is OpenCV, which is one of the largest open-source computer vision libraries and has many useful functionalities that help to streamline the programming process. The source code of the program can be found on the GitHub repository linked in the forward. There exists a variety of methods for doing motion detection, with each with its own strengths and drawbacks. "Background Subtraction" was found to be the most suitable candidate for this project. Mainly due to the ease of implementation and its capabilities. A brief text-based walkthrough of the code can be found in the last paragraph of 5.3 A simplified flowchart of the program can be seen in 6.6. 6.2. Motion detection program Figure 6.6: Flowchart of the Motion Detection Program 53 6.3. Anomaly Detection, TDOA and sound source locating program 6.3 54 Anomaly Detection, TDOA and sound source locating program As with the motion detection program, the anomaly detection, TDOA, and sound source localization algorithm are also written in python. The main libraries put into use in this program are Numpy, to calculate the fast Fourier transforms and the cross-correlation, and then the library Sounddevice, to do the recordings. The program begins by recording a two-second sound sample from all three microphones. Each microphone is numbered according to the proof from section 5.5, and the recordings from the microphones are similarly named rec1, rec2, and rec3. After the three simultaneous recordings have been made, an FFT is performed on each sound sample and then compared to a noise template that has been created earlier. How a noise template is made is explained in section 5.2.2. The program checks for each recording if the FFT contains spikes in frequencies that lie above the noise template. If that is the case, the microphone will have heard an anomaly. If only one or two microphones hear anomalies, it will start over and record three new samples, but if all three microphones hear anomalies the program will instead continue to try and locate it. To locate the sound source, the program finds the TDOAs between all three microphones using cross-correlation as explained in section 5.4.1. It does cross-correlation between all possible pairings of the recordings. This gives three TDOAs named t12 , t23 , and t13 . From these three TDOAs, a DOA is calculated in accordance with the algorithm laid out in section 5.5. A simplified flowchart of the program can be seen in Figure 6.7. 6.4 ROS ROS (Robot Operating System) is an open-source framework designed for developing robotic applications [49]. It is a middleware solution that connects software and hardware in robotic solutions. It has a modular structure consisting of, among others, topics and nodes, simplifying the process of working with different sets of pre-built hardware. Additionally, the open-source nature of ROS ensures a variety of pre-built libraries and packages for commonly used components and robots. It also has a wide range of support for different operating systems including Linux, Windows, and macOS, as well as supporting the use of a variety of programming languages, like C++ and Python. The latest release of ROS is ROS 2 - Humble Hawksbill. This project, however, uses ROS Noetic Ninjemys for Ubuntu 20.04. 6.4. ROS 55 Figure 6.7: Flowchart of the Anomaly detection, TDOA, and sound source locating program This section intends to, in general terms, describe how ROS works, by using nodes and topics. 6.4.1 Topics Topics in ROS essentially work the same way a variable does in a regular program. It holds some information that may be used or changed by some nodes. This information is stored as a certain type; in ROS, these types are called messages. Messages can be compared to the variable types of a regular programming language, however, with increased versatility. These can range from a simple Int8, which includes a single 8-bit integer value, to more complicated messages, like the 6.4. ROS 56 Imu message type, which includes information about the robot’s angular orientation, velocity, and acceleration in all three dimensions. Topics are useful since they work as a middle ground for software to interact with one another. By utilizing topics, programs no longer have to interact directly with one another, but can simply manipulate and read the same data. 6.4.2 Nodes In ROS, nodes are the processes that perform the actual computation of the robot. They can range from very simple to very complex, depending on the use. There are two types of nodes: subscriber nodes and publisher nodes. Subscriber nodes are subscribed to a topic, meaning they perform some operation whenever a topic changes. Publisher nodes ’publish’ to a topic, meaning they update a topic at a certain predetermined rate. A node can be subscribed to multiple topics, publish to multiple topics, and simultaneously subscribe to one topic and publish to another. 6.4.3 Project ROS implementation This project consists of 3 custom nodes and 3 custom topics. Additionally, it utilizes the prebuilt packages for the Turtlebot 2, including both nodes and topics. When running, the turtlebot_bringup/minimalbringup.launch file is called in order for the program to work. A graph of the relations between topics and nodes can be seen in figure 6.8. The motion_detecter node This node uses the code described in section 6.2. The node works independently from the rest of the nodes. It reads information directly from the webcam described in section 6.1.2, and publishes to the /motion_detection/motion_detected topic. This topic is of the type Bool and is true if there is motion, and false if there is not. This is implemented in order to have a value that is easy to use in any other node. Due to the modular nature of ROS, it is possible to run multiple instances of the motion_detecter node. This means, that another camera can easily be added to the setup for increased vision. 6.4. ROS 57 Figure 6.8: Graph of ROS nodes (using the rqt_graph function). This can be found in a bigger version in appendix A. The anomaly_detection_and_tdoa node This node uses the code described in section 6.3. Whenever the node calculates a goal angle, it publishes to the /localization_topics/goal_angle topic. It does not publish at a specific rate, as the program is deemed slow enough to not be a problem. The program is estimated to publish at a rate of approx. 0.4-0.5 Hz. When the robot is moving, the node does not publish to the /localization_topics/goal_angle topic, sleeping for 100 ms repeatedly, until the robot has stopped moving. The /localization_topics/goal_angle topic is of the type Float32 and represents the angle, θ, where θ ∈ (−π; π ]. Running multiple instances of this node is not possible as the code is presented in this The driving_controller node When an angle has been published to the /localization_topics/goal_angle topic, it is read by the driving_controller node. This node publishes an angular velocity to the /cmd_vel_mux/input/navi topic in order to control the rotation of the Turtlebot. Its velocity is controlled using a proportional controller, proportional to the angular distance to the goal angle. This a pretty simple solution, that allows 6.4. ROS 58 the robot to arrive at its target angle with high accuracy. Since it is practically impossible to arrive at the exact angle desired, the robot has a tolerance of 0.01 angle units, corresponding to 1.8° or approx. 0.031416 rads. Whenever the robot is moving, it publishes to the /localization_topics/moving topic. This topic is of the type Bool and represents that the robot is in motion. This topic has been implemented since it is simpler than checking all velocity values of the robot. Chapter 7 Verification and Use Case Validation This chapter aims to document the testing and verification process of the motion detection, sound anomaly detection, and sound source localization software. 7.1 Resource limitations The allotted resources for this project limit the prototyping abilities; however, a concept prototype can be constructed to test the audio and visual localization capabilities of the chosen concepts. This concept prototype will be a Turtlebot 2 robot, with a webcam and an array of diaphragm condenser microphones. The testing of the robot will occur in a sound lab at Aalborg University, where the limits of its abilities will be tested. 7.2 7.2.1 Motion detection program Testing The test cases To test the limits of the motion detection program, a series of tests will be performed for the following scenarios: • Control • Multiple moving objects • Non-humanoid-shaped objects 59 7.2. Motion detection program Testing 60 • Partial obstruction of the path • Decreased/increased camera quality • Varying distances (1 m, 3 m, 5 m, 10 m, 20 m) • Varying light levels • Varying backgrounds These tests will be performed by moving a human-shaped object across the field of view of the camera, and recording how many times it correctly detects the object. If not otherwise specified, the test is performed at a distance of 3 meters and a brightness of around 190 lux. 7.2.2 Results of motion detection tests Each of the tests was performed 5 times. The test was executed with small variations. • Multiple moving objects This test was designed to test the program’s capacity to detect multiple moving objects in a single scene. The test was done using a person moving their arms, while another person in the foreground moved from one side of the frame to another. In all five tests, the program correctly identified the anomalies, with no false positives or false negatives. • Non-humanoid-shaped objects This test was designed to test the program’s capacity to detect objects of different sizes and colors. The test was done using different items of varying sizes ranging from a box of size 10x10x10 to an average-sized human. In all five tests, the program correctly identified the anomalies, with no false positives or false negatives. • Partial obstruction of the path This test was designed to test the program’s capacity to detect objects of different sizes and colors. The test was done using different items of varying sizes ranging from a black box 10x10x10 cm to an average-sized human. In all five tests, the program correctly identified the anomalies, with no false positives or false negatives. 7.2. Motion detection program Testing 61 • Decreased camera quality This test was designed to test the program’s capacity to detect objects across different camera qualities. The camera quality was artificially lowered, by blurring the input image with a gaussian filter. The tests started with normal camera input and were then gradually increased during the 4 other tests. In all five tests, the program correctly identified the anomalies, with no false positives or false negatives. • Varying distances (1 m, 3 m, 5 m, 10 m, 20 m) This test was designed to test the program’s capacity to detect objects across different distances. First one meter and then increasing as the tests go on. The program succeeded in all test cases at all distances. • Testing at varying light levels and distances In this test, the program was tested at the distances 1 m, 3 m, 5 m, 10 m, and 20 m. Additionally, the distances 1 m, 3 m, and 5 m were tested at different light levels. These light levels were: – Well lit: Fully lit by either full daylight or multiple light sources. – Slightly lit: A single light source is present. – Poorly lit: No light sources are lit, but light is present from an adjacent room (behind the camera). – Dark: No light source is present. A table with the distance- and light-level-tests can be found in table 7.1. In this test, the lux level was measured at the point of movement for each distance and light level. Another table (7.2) showing the rest of the test results can be found. 7.2.3 Visual Test Conclusion From these results, it can be concluded that the visual detection function of the robot solution is not limited by partial obstructions, distance up to 20 meters, or the introduction of multiple and non-humanoid anomalies. As seen in table 7.2, no cases of either false positives or false negatives have been seen in any well-lit environments. This can be used to conclude that the program can be considered 62 7.2. Motion detection program Testing Distance TP Well lit FP FN Slightly lit TP FP FN Poorly lit TP FP FN TP Dark FP FN 1m Lux level 15 0 30 0 15 0 11 0 15 0 8.9 0 0 0 0 15 3m Lux level 15 0 17 0 15 0 7.8 0 10 0 3.9 5 0 0 0 15 5m Lux level 15 0 13 0 15 0 4.1 0 0 0 0 15 0 0 0 15 10 m Lux level 5 0 212 0 - 20 m Lux level 5 0 181 0 - Table 7.1: Table of motion detection at different distances and light levels Test type True Positives False Positives False Negatives Precision Recall Control Multiple objects Non-humanoid-shaped objects Partial obstruction of the path Decreased camera quality Varying Backgrounds 5 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% Table 7.2: A table of the visual tests results reliable when used in well-lit environments. This is to be expected as background subtracting works at its best in a static well-lit environment. However the perfect results might also be a consequence of non sufficient stress test of the program, so the results should be considered with skepticism. However as seen in table 7.1, the motion detection performance decreases when the environment is poorly lit. This is mainly due to the hardware limitations of the camera, not a limitation of the program. The algorithm can only detect what the camera outputs and if the camera outputs a black image it is not possible for the algorithm to detect anything. Possible solutions to this problem will be explored in chapter 8. 7.3. Audio Testing 7.3 7.3.1 63 Audio Testing Assumptions Prior to the experiment, a number of assumptions need to be made: 1. The speed of sound is set to 343 m/sec and fluctuations in speed due to pressure and temperature differences are ignored. 2. Only one sound source is present. 3. The sound is emitted omnidirectionally. 4. The anomaly being detected is on the same horizontal plane as the robot. Using these assumptions a list of auditory tests has been developed that will be tested on the system. • Identify a single sound anomaly in the same room (Soundproof room with low background noise) • Identify a single sound anomaly in the same room (Soundproof room with high background noise) • Identify a single sound anomaly in the next room (Door open, soundproof room with low background noise) • Locating single sound anomaly at a low distance (2 meters) (Finding DOA in a quiet room) • Locating single sound anomaly at a medium distance (5 meters) (Finding DOA in a quiet room) • Locating single sound anomaly at a high distance (10 meters) (Finding DOA in a quiet room) 7.3.2 Audio Testing Results Anomaly Detection The testing consisted of an anomaly detection portion and a DOA estimation portion. Anomaly detection was performed in a soundproof room while the DOA estimation was done in a regular quiet room to allow us to test different distances that the smaller soundproof room does not allow. 64 7.3. Audio Testing For anomaly detection, each test is done at a different sensitivity level or background noise level. The sensitivity level of the program is the amplitude threshold that must be exceeded for a certain frequency to count as an anomaly. There are three sensitivity levels: Low, Medium, and High. The Low sensitivity level is triggered when the amplitude at a certain frequency exceeds 300% of the amplitude of the same frequency in the noise template, while the Medium sensitivity level is triggered at 200%, and the High sensitivity level is triggered at 130%. The noise templates’ noise levels vary from 25 decibels (Low), to 50-55 decibels when simulating a less quiet environment (High). Each test consisted of 100 samples (samples are 2 seconds each) throughout which a number of sound anomalies were introduced at random. The number of anomalies produced and the amount detected by the robot were then compared, producing the results seen in Table 7.3. Background Noise Level Sensitivity True Positives False Positives False Negatives Precision Recall Low (25 decibels) Low (25 decibels) Low (25 decibels) High (50-55 decibels) High (50-55 decibels) Low (Next room, open door) Low (Next room, open door) Low (Next room, open door) Low Medium High Low Medium Low Medium High 0 4 8 4 71 0 0 23 7 0 0 2 0 26 0 0 100% 83% 78% 81% 24% N/A 100% 51% 76% 100% 100% 90% 100% 0% 100% 100% 22 19 29 17 22 0 26 24 Table 7.3: This table shows the results gathered in the anomaly detection testing During the test of the high background noise, it was quickly established that anything above a low sensitivity would only result in almost all samples being determined as an anomaly. Therefore no test with high sensitivity was performed. Based on the anomaly detection test results, it was concluded that in order to get the best results, it would be neccessary to choose an appropriate sensitivity configuration depending on the environment. If the environment is quiet (25 decibels), the best results would be acheived at a sensitivity of medium or high. Choosing high for this environment will give the occasional false positives, but it also means it will give a very small amount of false negatives. This should give the best opportunities to discover an intruder. In a more noisy environment, it is important to choose a low sensitivity to avoid too many false positives. The low sensitivity 65 7.3. Audio Testing does have the disadvantage of occasionally missing an actual anomaly, but picking up on all anomalies in a very noisy environment will be difficult without detecting false positives. If the sensitivity is appropriate, the accuracy of the program is 80% with a recall close to 100% in an ideally quiet environment. In a noisy environment the accuracy is is 80% with a recall of around 90%. Direction of Arrival The room chosen as the location for the DOA testing was a large lecture room at Aalborg University. The size of the room allowed for the testing of the DOA function at both 2 meters and 5 meters, which was not possible in the soundproof room, however, it also meant that there was a noticeable echo when an anomaly was produced. This, along with occasional noise coming from the outside of the room, should be taken into consideration when analyzing the results of the tests. With the room chosen, 9 different angles are measured out and marked in the room with a distance of both 2 and 5 meters. At each of these 18 points, 20 anomalies are produced, and the direction estimated by the DOA program is noted. Any results containing errors were ignored and uncounted, as the suspicion was that they stemmed from either the echo or the external noise. 2 meters Angle (DEG) 0 30 90 100 160 180 -65 -90 -130 Average deviation Minimum deviation 6.75° 19.60° 5.00° 6.10° 3.00° 8.80° 9.10° 7.20° 13.80° 1° 5° 5° 0° 3° 8° 2° 6° 4° (1°) (35°) (95°) (100°) (157°) (172°) (-67°) (-84°) (-134°) 5 meters Maximum deviation 11° 21° 5° 12° 3° 9° 32° 22° 32° (-11°) (51°) (95°) (112°) (163°) (-171°) (-33°) (-112°) (-162°) Average deviation Minimum deviation 8.45° 23.15° 5.95° 12.00° 7.00° 13.45° 10.00° 20.00° 6.00° 1° 20° 5° 12° 0° 8° 2° 14° 6° (-1°) (50°) (95°) (112°) (160°) (172°) (-67°) (-104°) (-124°) Maximum deviation 20° 34° 10° 12° 30° 18° 52° 22° 6° (-20°) (64°) (100°) (112°) (130°) (-162°) (-13°) (-112°) (-124°) Table 7.4: A table showing the recorded angles’ deviation from the measured angle. In parentheses is the value of the recorded angle. As seen in Figure 7.4, the microphone array has more difficulty estimating the direction of a sound at some angles, than it does at others. This is, for example, 7.3. Audio Testing 66 a problem at the direction 30◦ , where the average deviation is 19.6◦ at 2m, which increases to 23.15◦ at 5m. These deviations are much higher than those measured at 160◦ , which are only 3◦ at 2 meters and 7◦ at 5 meters. This is also the case for the angle -90◦ , which has an average deviation of 20◦ at 5m. These deviations seem relatively high, however, when taking into consideration that the webcam being used has a field of view of 65◦ , deviations below 22.5◦ still allow the visual detection system to identify the anomaly. This number comes from dividing the FOV by 2, to get the maximum visual deviation in each direction, subtracted by 10 to ensure the movement is within line of sight. As this was only the case 20 times throughout all 360 tests, with 8 deviations at 2 meters and 12 deviations at 5 meters. The anomaly-producing object is within the requirements at a success rate of roughly 95.6% at 2 meters, and 93.3% at 5 meters. An increase in the average deviation can also be seen from the tests when going from 2m to 5m. This is the case for all angles tested, except for those done at -130◦ , however as the other angles point to a trend, this can be seen as an anomaly. With this information, it can be theorized that at a certain distance, the average deviation may increase to above 22.5◦ . This would mean that in the majority of cases, the deviation would be too high for the robot to be able to reliably identify the anomaly using the visual detection that would follow. With further testing, the maximum acceptable distance could be found. Difficulties with the solution A problem that the group ran into while performing these tests, was the occurrence of NaN errors. NaN stands for "Not a Number" and became an issue for the detection program due to the expected and measured time differences for t12 and t13 not matching. The theory as to why this became a prevalent issue under the auditory localization tests, is that noise from external sources can interfere with the cross-correlation and thus the measured time differences. This reasoning explains why there was an increase in the number of NaN errors under the localization testing, as these tests needed to be performed in a larger test area without the anechoic properties of the chamber used for the anomaly detection tests. The introduction of noise from external sources can make the microphones register a noise resulting in time differences that do not make sense. This issue had varying frequency, with no occurrences for tests at some angles, and over 60% for tests at other angles. After analyzing the results, no obvious link between angle and amount of NaN errors can be made without further testing. Chapter 8 Discussion This chapter will first discuss errors and problems encountered during testing, along with ways to improve current methods. Lastly, it will check the tests made in chapter 7 in accordance with the requirements made in chapter 3. 8.1 Sources of error Since this project works with sound and vision, there are many possible sources of error. The discovered sources are listed below. The errors associated with sound influence the TDOA calculation algorithm and the errors associated with light and color influence the motion detection algorithm. 8.1.1 Sound-Associated Errors When operating outside of an anechoic chamber, sound reflection will always be present to some degree. This can mean that a microphone may register a sound multiple times; once directly from the sound source and again reflected from surfaces in the environment. This may happen for all microphones, or maybe only some. This can impact the calculated TDOAs, thus impacting the calculated angle. Additionally, the sampling rate of the sound card used is limited to 48 kHz. This means that minute differences are more difficult to detect, as these are rounded from the limited sampling rate. Another problem, that has been experienced, is the problem of background noise. This severely decreases the performance of the program, since it has problems 67 8.2. Areas of Improvement 68 differentiating between sound anomalies and background noise 8.1.2 Light and Color Associated Errors Since the motion detection is performed using background subtraction, the algorithm is naturally more sensitive when more light is present, due to the constant difference threshold. This results in the program being worse at noticing movement in darker environments. Additionally, simply increasing the sensitivity of the program will make it more susceptible to noise, since noise occurs more frequently when filming dark environments Additionally, since background subtraction is used, the robot has difficulties detecting the movement of an object if the background behind it is of a similar color. This is especially true for darker colors, as these also make it more difficult to see shadows 8.2 8.2.1 Areas of Improvement Areas of Improvement in the Sound Analysis There are multiple ways to improve sound source localization. One way is to increase the sampling rate. This is not something that can be infinitely increased, as it will have an effect on the program’s performance. Additionally, it is not a solution that will fix everything. Another possibility is to increase the distance between the microphones. This will effectively increase the TDOAs found, which also means that the microphones would have to be less sensitive for the same result. On the other hand, they would be more sensitive to the same sampling rate. Lastly a change of method in terms of sound source localization. There are many different types of cross-correlation and only the most basic and simple type was used in this project. It was not very robust to noise and often couldn’t identify the sound source with a bit of noise unless the source was very clear, this is elaborated on in Section 7.3.2. To achieve a higher noise resistance, more complex crosscorrelation methods could be implemented such as a generalized cross-correlation method (GCC method) which is a step up from the regular cross-correlation (CC). While the regular CC method, in simple terms, takes two signals, performs FFTs 8.2. Areas of Improvement 69 on both of them so that they are in the frequency domain, then finds the sliding dot product between them, and at last performs an inverse FFT on the sliding dot product to transform it back into the time domain. The GCC adds an extra step of weighting the frequencies before they are transformed back into the time domain. There are a couple of different weighting functions to choose from CC, ROTH, SCOT, and PHAT, where PHAT (PHAse Transform) is the only usable in our case [8]. Another problem is the system’s current inability to detect sound anomalies when driving. This results in the robot essentially being unable to detect anything when moving. 8.2.2 Areas of Improvement in Motion Detection The main area of improvement in the motion detection algorithm is the ability of the program to detect motion in darker environments. One possible way to do this could be to dynamically calibrate the sensitivity of the program, depending on the light level. This has, however, not been properly experimented with. Another approach could be to implement another type of camera that is less impaired in low-light conditions. One example of such a camera would be an infrared camera, which instead detects infrared light specifically. These would however not be able to detect non-living agents (i.e. objects with a surface temperature similar to that of the environment. Another example of a camera more suited for low-light conditions would be a depth-sensing camera, which would be capable of 3D-mapping the environment, and thus, be better at discovering changes despite poor light conditions. Using such cameras would however come with an increased price when compared to an RGB camera. Another area of improvement is the system’s current inability to track motion when moving. A way to improve this would be to include some movement compensation when running the motion detection program 8.2.3 Miscellaneous Areas of Improvement Another area of improvement is the movement controller. Currently, it is a simple proportional controller. This simplistic approach was chosen because the movement was not deemed an important part of this project. It is however still an apparent area of improvement. One way of improving the movement controller would be implementing a PID controller, instead of just a P controller. 8.3. Requirement Fulfillment 8.3 70 Requirement Fulfillment To properly assess the capabilities of the prototype, the results of the multiple rounds of testing can be compared to the requirements formulated at the start of the project, seen in Table 4.1. Requirement 3.1 sets a goal of 90% precision for the anomaly detection aspect of the robot and Requirement 3.2 sets a goal of 90% recall for the same system. • Using the most optimal noise template for a low background noise level (medium) a precision of 83% and a recall of 100% was achieved within 100 samples. Only requirement 3.2 was reached, although it almost reached requirement 3.1. • Using the most optimal noise template for a high background noise level (low) a precision of 81% and a recall of 90% was achieved. Only requirement 3.2 was reached, although it almost reached requirement 3.1. • Using the most optimal noise template for the next room, a low background noise level (medium) a precision of 100%, and a recall of 100% were achieved. Requirements 3.1 and 3.2 were reached. From this it can be gathered, that the sound anomaly detection aspect does not meet the initial requirements, except for when detecting sound in a different room with low background noise Requirement 3.3 sets the maximum acceptable deviation to 22.5◦ and expects a successrate of 90%. • At a distance of 2 meters the success rate was 95.6%, meaning it reached requirement 3.3. • At a distance of 5 meters the success rate was 93.3%, meaning it reached requirement 3.3. From this, it can be gathered, that the DOA aspect of the project has met the initially set requirement. Requirement 3.4, sets a goal of using visual detection to identify a human anomaly in a well-lit environment 95% of the time. • At a distance of 1 meter, the tests yielded a 100% precision and 100% recall. 8.3. Requirement Fulfillment 71 • At a distance of 3 meters, the tests yielded a 100% precision and 100% recall. • At a distance of 5 meters, the tests yielded a 100% precision and 100% recall. • At a distance of 10 meters, the tests yielded a 100% precision and 100% recall. • At a distance of 20 meters, the tests yielded a 100% precision and 100% recall. As 100% of tests in a well-lit environment were successfully able to identify the anomaly presented, this requirement is determined to be fulfilled. The final requirement, Requirement 3.5, sets a goal of using visual detection to identify a human anomaly in a poorly-lit environment 80% of the time. • At a distance of 1 meter, the tests yielded a 100% precision and 100% recall. • At a distance of 3 meters, the tests yielded a 66.67% precision and 100% recall. • At a distance of 5 meters, the tests yielded a 0% precision. The recall could not be defined. From this, it can be concluded, that the visual detection algorithm fulfills the requirement at a distance of 1 meter, but not at a distance of 3 meters and above. Additionally, the fact that only the precision is below expectations may be indicative of the program being too insensitive. Chapter 9 Conclusion The aim of this project was to create a perception system to detect humans for a mobile robot platform. The project had the specific problem statement: "How can a perception system be made to detect humans for a mobile robotic platform?". From this problem statement, a concept design was made and using these two as a stepping board, a number of requirements were set, which can be found in Table 4.1. It can be concluded, that the product documented in this report has completed the requirements as shown in Table 9.1. No. 3.1 3.2 3.3 3.4 3.5 System Requirement Precision of Sound Anomaly Detection Recall of Sound Anomaly Detection Sound Anomaly Identification Human Identification in well-lit environment Human Identification in poorly-lit environment Status Partly fulfilled* Fulfilled Fulfilled Fulfilled Partly fulfilled* Table 9.1: Table showing the fulfillment status of the initially set system requirements. *Partly fulfilled is to be understood as the requirement has been fulfilled under certain, but not all, conditions It can be seen that three of the initial requirements have been fulfilled, and the other two are fulfilled under certain conditions. Specifically, it has been shown that the system made in this report is capable of detecting motion in a well-lit environment and is mostly capable of detecting sound anomalies in both quiet and noisy environments. It has also been shown, that the system is mostly incapable of detecting motion in a poorly lit environment - only being capable of detecting motion at a distance of 1 meter to the target. 72 73 Additionally, possibilities for further work have been discussed. These include: 1. Finding a more accurate way to determine the DOA of a sound anomaly possibly using a different method such as generalized cross-correlation. 2. Finding a way for the motion detection system to work in non-well-lit environments - possibly using a different type of camera. 3. Finding a way for the motion detection system and the anomaly detection system to function despite the movement of the mobile robot base, and the sound produced by the motors. 4. Implementing a better movement controller, in order for the robot to move in a smoother manner. With this further work, it is assumed that the project could reach a higher level of accuracy when detecting anomalies, while also being generally more functional. Bibliography [1] url: https : / / www . amazon . com / Logitech - 960 - 000045 - 720p - Webcam C905/dp/B000RZNI4S. [2] 3Blue1Brown. But what is the Fourier Transform? A visual introduction. [3] About nimbo. url: http://hellonimbo.com/about/. [4] S. Adrián-Martínez et al. Acoustic signal detection through the cross-correlation method in experiments with different signal to noise ratio and reverberation conditions. 2015. doi: 10.48550/ARXIV.1502.05038. url: https://arxiv.org/ abs/1502.05038. [5] Edge AI + Vision Alliance. Camera Selection – How Can I Find the Right Camera for My Image Processing System? url: https://www.edge- ai- vision.com/ 2019/03/camera-selection-how-can-i-find-the-right-camera-for-myimage-processing-system/. [6] nti audio. Fast Fourier Transformation FFT - Basics. [7] Behringer ECM8000 målings mikrofon. [8] Lin Chen et al. “Acoustic Source Localization Based on Generalized Crosscorrelation Time-delay Estimation”. In: Procedia Engineering 15 (2011). CEIS 2011, pp. 4912–4919. issn: 1877-7058. doi: https://doi.org/10.1016/j. proeng . 2011 . 08 . 915. url: https : / / www . sciencedirect . com / science / article/pii/S1877705811024167. [9] Prof. Efstathiou Constantinos. [10] Donatello Conte et al. “An Ensemble of Rejecting Classifiers for Anomaly Detection of Audio Events”. In: 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance. 2012, pp. 76–81. doi: 10.1109/ AVSS.2012.9. 74 Bibliography 75 [11] Cyrus Farivar. Security robots expand across U.S., with few tangible results. 2021. url: https : / / www . nbcnews . com / business / business - news / security robots-expand-across-u-s-few-tangible-results-n1272421. [12] Basler Ag Felix Asche. Camera selection for low-light imaging. 2021. url: https: / / www . photonics . com / Articles / Camera _ Selection _ for _ Low - Light _ Imaging/a66942. [13] Alison Fields, Steven Linnville, and Robert Hoyt. “Correlation of objectively measured light exposure and serum vitamin D in men aged over 60 years”. In: Health Psychology Open 3 (May 2016). doi: 10.1177/2055102916648679. [14] Emilie Foxil. Så meget videoovervågning er der i Danmark. 2019. url: https : //nyheder.tv2.dk/politik/2019-10-09-saa-meget-videoovervaagninger-der-i-danmark. [15] Global Patrol Robot Market Size and forecast. url: https://www.marketresearchintellect. com/product/global- patrol- robot- market- size- and- forecast/?utm_ source=Designerwomen&utm_medium=127. [16] Aman Preet Gulati. Vehicle motion detection using background subtraction. 2022. url: https://www.analyticsvidhya.com/blog/2022/03/vehicle-motiondetection-using-background-subtraction/. [17] Yi Guo and Zhihua Qu. “Coverage control for a mobile robot patrolling a dynamic and uncertain environment”. In: Fifth World Congress on Intelligent Control and Automation (IEEE Cat. No.04EX788). Vol. 6. 2004, 4899–4903 Vol.6. doi: 10.1109/WCICA.2004.1343643. [18] Roger A. Freedman Hugh D. Young. University physics - With modern Physics. Fifteenth edition with SI units. Pearson, 2020, pp. 1213–1221. [19] Stemmer Imaging. Colour Cameras. url: https : / / www . stemmer - imaging . com/en/knowledge-base/colour-cameras/. [20] Indbrud I Fire Ud af ti bygge- og anlægsvirksomheder. 2019. url: https://via. ritzau.dk/pressemeddelelse/indbrud- i- fire- ud- af- ti- bygge-- oganlaegsvirksomheder?publisherId=12604233&releaseId=13575848. [21] Iben Peders Isabell Bang Christensen. Antallet af anmeldte indbrud falder fortsat. url: https://www.dst.dk/da/Statistik/nyheder- analyser- publ/nyt/ NytHtml?cid=33206. Bibliography 76 [22] Ui-Hyun Kim, Kazuhiro Nakadai, and Hiroshi G. Okuno. Improved sound source localization in horizontal plane for binaural robot audition - applied intelligence. 2014. url: https://link.springer.com/article/10.1007/s10489014-0544-y. [23] Koshsh. 2021. url: https : / / smpsecurityrobot . com / products / robot thermal-camera/. [24] University of Oslo Kristian Nymoen. Quantitative Sound Analysis and the Visual Representations of Sound. [25] Mike Levine. The shape of things to come: Different types of microphones and when to use them. 2022. url: https : / / www . popsci . com / reviews / types - of microphones/. [26] Song Li and Jürgen Peissig. “Measurement of Head-Related Transfer Functions: A Review”. In: Applied Sciences 10.14 (2020). issn: 2076-3417. doi: 10. 3390/app10145014. url: https://www.mdpi.com/2076-3417/10/14/5014. [27] Wenqi Li, Dehua Chen, and Jiajin Le. “Robot Patrol Path Planning Based on Combined Deep Reinforcement Learning”. In: 2018 IEEE Intl Conf on Parallel Distributed Processing with Applications, Ubiquitous Computing Communications, Big Data Cloud Computing, Social Computing Networking, Sustainable Computing Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom). 2018, pp. 659–666. doi: 10.1109/BDCloud.2018.00101. [28] Hanhe Lin et al. “Online weighted clustering for real-time abnormal event detection in video surveillance”. In: Proceedings of the 24th ACM international conference on multimedia. 2016, pp. 536–540. [29] Hong Liu and Miao Shen. “Continuous sound source localization based on microphone array for mobile robots”. In: 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2010, pp. 4332–4339. doi: 10 . 1109 / IROS.2010.5650170. [30] Thomas B. Moeslund. Introduction to video and Image Processing Building Real Systems and Applications. Springer London, 2012. [31] Geoffrey. Morrison. Fisheye Or Wide-Angle Lens For Travel. 2022. [32] Amna Rahman Zakria Qadir Abbas Z. Kouzani Muhammad Usman Liaquat Hafiz Suliman Munawar and M. A. Parvez Mahmud. “Sound Localization for Ad-Hoc Microphone Arrays”. In: Energies (2021). url: https : / / www . mdpi.com/1996-1073/14/12/3446/pdf. Bibliography 77 [33] Guilherme Gaigher Netto. Optical flow and motion detection. 2019. url: https: / / medium . com / @ggaighernt / optical - flow - and - motion - detection 5154c6ba4419. [34] Inc. Open Source Robotics Foundation. TurtleBot2. url: https://www.turtlebot. com/turtlebot2/. [35] Ata-Ur Rehman et al. “Multi-Modal Anomaly Detection by Using Audio and Visual Cues”. In: IEEE Access 9 (2021), pp. 30587–30603. doi: 10.1109/ ACCESS.2021.3059519. [36] Ashutosh Saxena and Andrew Y. Ng. “Learning Sound Location from a Single Microphone”. In: (2009). url: https :/ /cs .stanford. edu/ people / asaxena/monaural/monaural.pdf. [37] Ashutosh Saxena and Andrew Y. Ng. Learningsoundlocationfromasinglemicrophone - Stanford University. 2009. url: https://cs.stanford.edu/people/ asaxena/monaural/monaural.pdf. [38] Kamal Sehairi, Fatima Chouireb, and Jean Meunier. “Comparative study of motion detection methods for video surveillance systems”. In: Journal of Electronic Imaging 26.2 (2017), p. 023025. doi: 10.1117/1.jei.26.2.023025. url: https://doi.org/10.1117%2F1.jei.26.2.023025. [39] Serviceforbundet: Peter Jørgensen DI overenskomst: Annette Fæster Petersen Vagt-og Sikkerhedsfunktionærernes Landssammenslutning: Robet F. Andersen. Di - Danmarks Største arbejdsgiver- og erhvervsorganisation - dansk ... 2020. url: https://www.danskindustri.dk/DownloadDocument?id=161749& docid=64162. [40] Sound Fields: Free versus Diffuse Field, Near versus Far Field. 2020. url: https: //community.sw.siemens.com/s/article/sound- fields- free- versusdiffuse-field-near-versus-far-field. [41] Danmarks Statistik. Indbrud i forretning, virksomhed mv. Q1-2022 Q2-2022 [Hele Landet]. url: https://www.statistikbanken.dk/straf11. [42] Lasse NIkolaj Staun. SÅ mange indbrud bliver der begået i Danmark om året. 2020. url: https://dkr.dk/indbrud/indbrud-i-tal. [43] Stacy Stephens. K5. 2021. url: https://www.knightscope.com/k5/. [44] Adobe Stock. 360 Bilder – Bläddra Bland 117,484 Stockfoton, Vektorer Och Videor. 2022. [45] T-Studio. ZOOM F6 Field Recorder. Bibliography 78 [46] Temporal Representations: Time-Measuring Circuits. url: https://doctorlib. info/physiology/medical/90.html. [47] Kobuki TurtleBot. iClebo Kobuki Robot Base. url: http://kobuki.yujinrobot. com/about2/. [48] Vocal Technologies. url: https://vocal.com/echo-cancellation/spatialsampling-and-aliasing-with-microphone-array/. [49] Why ROS? 2022. url: https://www.ros.org/blog/why-ros/. 79 80 Appendix A RQT_graph

Audio-Visual localization of Humans for Robotic Patrolling of Indoor Environments

Related documents

Products

Support

Audio-Visual localization of Humans for Robotic Patrolling of Indoor Environments

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib