Uploaded by papamo9578

Audio-Visual localization of Humans for Robotic Patrolling of Indoor Environments

advertisement
Audio-Visual localization of
Humans for Robotic Patrolling of
Indoor Environments
- Aalborg University -
Project Report
ROB3_gr01
Aalborg University
Electronics and IT
Copyright © Aalborg University 2015
Electronics and IT
Aalborg University
http://www.aau.dk
Title:
Audio Visual Localisation
Theme:
Automatic Sensing of the Environment
Project Period:
Fall Semester 2022
Project Group:
ROB3_gr01
Participant(s):
Jonathan Rod Skarregaard
Silas Porsgaard Steensgaard
Christoffer Thomas Ulf Koch Andersen
Hans Henrik Dalgaard
Peter Plass Jensen
Supervisor(s):
Jesper Rindom Jensen
Copies: 1
Page Numbers: 80
Abstract:
According to Danmarks Statistik, there
have been committed 3849 burglaries at
company- and business properties in the
first six months of 2022. This report looks
to explore the possibility of developing a
mobile robot, equipped with video and
audio sensors, to patrol large business
properties in order to decrease the workload of security forces while maintaining
a high degree of safety. The robot uses
audio source localization to detect sound
anomalies while patrolling and motion detection algorithms on the camera feed to
serve as an early warning system for possible intrusions. The audio source localization uses a three-microphone array and
cross-correlation to determine interaural
time differences which allow for the estimation of sound source direction. The motion detection algorithm uses a live video
feed and performs background subtraction on the images to detect and draw
bounding boxes around objects in motion.
Date of Completion:
April 30, 2023
The content of this report is freely available, but publication (with reference) may only be pursued due to
agreement with the author.
Contents
Preface
vi
1
Introduction
2
Problem Analysis
2.1 Target Demographic . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Break-ins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Environmental Challenges . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Lighting Conditions . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Sound Conditions . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.3 On Sound Diffraction . . . . . . . . . . . . . . . . . . . . . . .
2.3.4 Using Multiple Sensors . . . . . . . . . . . . . . . . . . . . . .
2.4 Alternative Solutions to assist Security Guards . . . . . . . . . . . . .
2.4.1 A Camera-microphone solution . . . . . . . . . . . . . . . . .
2.4.2 A Mobile Robot Solution . . . . . . . . . . . . . . . . . . . . .
2.5 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Commercial products . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.3 Path Planning using Reinforcement Learning and Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.4 State of the Art Conclusion . . . . . . . . . . . . . . . . . . . .
2.6 Sensor Possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.1 Audio Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.2 Visual Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7 Subconclusion & Problem Formulation . . . . . . . . . . . . . . . . .
13
13
14
14
14
17
Requirements
19
3
1
iii
2
2
3
5
5
6
7
8
9
9
9
10
10
11
Contents
iv
4
.
.
.
.
.
.
21
21
23
27
27
27
28
.
.
.
.
.
.
.
.
.
.
.
.
30
30
30
32
35
35
36
37
37
37
40
41
42
5
6
7
Design Concept
4.1 Camera Setups . . . . . . . . . .
4.2 Sound localization . . . . . . . .
4.3 On the use of multiple sensors .
4.4 Concept Selection . . . . . . . . .
4.4.1 Microphone Selection . .
4.4.2 Functional Requirements
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Methodology
5.1 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Audio Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Fourier Transform in Audio Analysis . . . . . . . . . . . .
5.2.2 Using The Fourier Transform To Detect Sound Anomalies
5.2.3 Nyquist-Shannon Sampling Theorem . . . . . . . . . . . .
5.2.4 Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.5 Spatial Aliasing . . . . . . . . . . . . . . . . . . . . . . . .
5.2.6 Sampling Rate . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Motion Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Auditory localization . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Determining TDOA . . . . . . . . . . . . . . . . . . . . . .
5.5 Farfield DOA proof . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Implementation
6.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.1 Onboard Hardware . . . . . . . . . . . . . . . . . . . . . .
6.1.2 Additional Hardware . . . . . . . . . . . . . . . . . . . .
6.2 Motion detection program . . . . . . . . . . . . . . . . . . . . . .
6.3 Anomaly Detection, TDOA and sound source locating program
6.4 ROS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.1 Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.2 Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.3 Project ROS implementation . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
48
48
48
49
52
54
54
55
56
56
Verification and Use Case Validation
7.1 Resource limitations . . . . . . . . . . .
7.2 Motion detection program Testing . . .
7.2.1 The test cases . . . . . . . . . . .
7.2.2 Results of motion detection tests
7.2.3 Visual Test Conclusion . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
59
59
59
60
61
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
v
Contents
7.3
8
9
Audio Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.2 Audio Testing Results . . . . . . . . . . . . . . . . . . . . . . .
Discussion
8.1 Sources of error . . . . . . . . . . . . . . . . . . . . . .
8.1.1 Sound-Associated Errors . . . . . . . . . . . .
8.1.2 Light and Color Associated Errors . . . . . . .
8.2 Areas of Improvement . . . . . . . . . . . . . . . . . .
8.2.1 Areas of Improvement in the Sound Analysis
8.2.2 Areas of Improvement in Motion Detection .
8.2.3 Miscellaneous Areas of Improvement . . . . .
8.3 Requirement Fulfillment . . . . . . . . . . . . . . . . .
Conclusion
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
63
63
63
67
67
67
68
68
68
69
69
70
72
Bibliography
74
A RQT_graph
79
Preface
Aalborg University, April 30, 2023
This project has been written by gr_01 of the 3rd semester of the Robotics bachelor
at Aalborg University. The project was written over a 4-month period, from the
beginning of September until the end of December. The report discusses the use
of audio-visual localization in security robotics for patrolling large business properties. We would like to extend our gratitude to our supervisor for the guidance
he has given us.
The source code associated with this project is publically available in the master
branch of a Git repository found in the following link:
https://github.com/hh4000/p3_project
Hans Henrik Dalgaard
Peter Plass Jensen
<hdalga21@student.aau.dk>
<ppje21@student.aau.dk>
Silas Porsgaard Stensgaard
Christoffer Thomas Ulf K. Andersen
<spst21@student.aau.dk>
<ctuk21@student.aau.dk>
Jonathan Rod Skarregaard
<jskarr21@student.aau.dk>
vi
Chapter 1
Introduction
It is in the best interest of the majority of businesses and companies to keep their
properties and assets safe from burglaries and vandalism. Because of this, many
companies have taken measures to ensure this safety by installing security cameras,
motion detectors, and alarms on their properties and in their buildings. These measures give law enforcement the ability to react swiftly to security breaches with accurate descriptions of the intruders. Unfortunately, law enforcement entities do not
always have the necessary resources to respond to burglaries in time, and because
of this, many large businesses have chosen to employ private security companies
or guards to patrol and keep their businesses safe. Private security companies and
guards are expensive and contribute to additional expenses for their customers because security guards often work high-wage night shifts and often require one or
more security guards to patrol the premises of the businesses.
According to Danmarks Statistik, there have been 3849 burglaries of companies and
businesses in the first six months of 2022 alone, which corresponds to 21 burglaries
per day [41]. These burglaries can pose significant financial losses on businesses
and companies through loss of productivity and loss of assets. These losses may
seem negligible to large companies, but it is still in the best interest of the company to prevent as many as possible. Because of this, this report looks to explore
the possibility of using robotics to help relieve the workload of private security
measures and expenditures of private companies while maintaining a high degree
of security.
1
Chapter 2
Problem Analysis
Property security is an ever-present problem for private homes, commercial enterprises, and industrial complexes. As technology has progressed, there have come
many more ways of implementing security measures to prevent and detect intrusion. This includes but is not limited to, security cameras, guards, and motion
sensors. Another way to increase security could be to implement a patrolling security robot capable of detecting intruders.
This problem analysis aims to cover all relevant aspects of the above-mentioned
patrolling security robot. Firstly, the target demographic is analyzed in Section 2.1.
After this, the environmental challenges of the expected environment and potential
ways to circumvent them are tackled in Section 2.3. This is followed by a market
analysis exploring the state of the art, regarding security robotics, in Section 2.5.
Lastly, a final problem formulation is formed in Section 2.7.
2.1
Target Demographic
The required complexity of the robot means that the price of the robot is likely
to make it a less attractive option for smaller companies, as these companies are
less likely to have expensive inventory and equipment to make the robot financially viable. The use of the robot may also be viable for companies that may not
have high-value physical objects, but instead have confidential information that
would be detrimental in the hands of the wrong people, such as data centers. The
robot in itself may not be enough to prevent theft, as the robot has no functions
for apprehending or stopping intruders beyond alerting on-site security or police
and mainly functions as deterrence as it only has functions for discovering the
2
2.2. Break-ins
3
intruders.
This means that it functions optimally when implemented in combination with
on-site security. Additionally, the use of security cameras in the surveillance area
would need to be less attractive, whether due to the cost of implementing full
coverage camera surveillance of the area or if security is only needed short term.
If this is not the case, this solution would likely not be logical for the company, as
even with an optimal implementation, 24/7 complete surveillance would be more
effective. This leaves the following businesses as some of the potential consumers
of the product:
• Gated communities
• University campuses
• Critical infrastructure (such as electrical substations and water sources)
• Apartment complexes and offices
• Factories and warehouses
There may be some companies that decide to implement the robot to supplement
current security measures with a physical deterrent. This type of implementation
has been attempted in places such as Liberty Village in Las Vegas, USA. While there
is no definitive proof that this solution decreased crime rates in the apartment complex, the area was removed from the Las Vegas Metropolitan Police Department’s
list of top 10 areas with most frequent 911 calls after the launch of a Knightscope
patrol robot in the area. Therefore, the physical presence of a robot may be an
effective deterrent [11].
2.2
Break-ins
Burglaries in Denmark have been on a decline for the past many years, but, as mentioned in chapter 1, there have been thousands of company burglaries in Denmark
in the first 6 months of 2022 alone [42]. So, while burglaries are on a downward
trend, companies still have an interest in keeping security high as the chance of a
burglary is never zero. A graph of the burglaries in Denmark can be seen in Figure
2.1.
Construction sites and construction workers’ vans often fall victim to thefts at
night due to their valuable tools. When it happens, the workers are usually out
of commission for a day or two while they report the incident to the police or are
2.2. Break-ins
4
Figure 2.1: A graph from Danmarks Statistik showing the number of burglaries in residential buildings in blue, businesses in orange, and uninhabited residential buildings in green[21].
transferred to a different project. Thefts at construction sites and vans amount, on
average, to 2286 € not including wasted company time [20].
Most companies, especially larger companies, currently have security cameras and
alarms to help prevent incidents, but these systems are stationary and can be tampered with, and implementing a full security force is expensive [14]. Additionally,
some companies (e.g. factories) have vast properties that are nearly impossible
to patrol sufficiently with security forces within a realistic budget. This makes a
robotic patrol unit more attractive. Many large companies with high-value products or information do however still employ security firms and security guards for
either increased security, deterrence, or faster reaction time to intrusions.
Security guards have many responsibilities, including but not limited to patrolling,
standing guard, apprehending intruders and perpetrators, etc. some of which can
be done by an autonomous mobile robot. An average hourly wage of a security
guard is around 22 €/h depending on whether you have experience or not. An
extra 3.9 €/h is given to shifts extending into the hours between 17:00 and 06:00. If
a company would need a security guard on their premises between the hours 22:00
and 06:00 every day, this would total 56 hours per week. Assuming that the security guard is working for an average of 25.9 € per hour, this would total approx. 75
2.3. Environmental Challenges
5
400 € annually (not including bonuses for working during weekends or holidays)
[39]. The implementation of a robotic patrolling unit cannot fully replace an entire
security team, but it can replace parts of the job such as patrolling and standing
guard. Therefore the company can cover larger areas with only a few guards for
the verification and apprehension of the intruder.
2.3
Environmental Challenges
Since systems using audio or visual localization are entirely reliant on their sensor
inputs, it is crucial to ensure appropriate environmental conditions for optimal
operation. A perceptive system that is flooded with noise is essentially blind or at
least impaired since all sensor inputs are rendered useless, due to the inputs not
having any distinguishable information. This section will cover the consequences
of poor conditions and some ways to circumvent these audio-visual challenges.
2.3.1
Lighting Conditions
Systems that utilize computer vision, image processing, and the like are dependent
on appropriate lighting conditions in order for the onboard cameras to capture
useful images of the environment. This is especially relevant for systems that will
operate indoors, at night, or both. If the lighting conditions are poor, the system
could compensate by using pre-processing on the image to increase the brightness
and/or contrast. However, this increases the visual noise, making it more difficult
to gather correct information from the image, but still might be of value.
As seen in Figure 2.2, the estimated brightness of different environments is listed.
The brightness of the environment is measured in lux, also known as lumens per
square meter. Some of the important values here are those of the "Hallway" and the
"Office lighting", as these are the environments the robot will mostly be surveying.
The hallway with an estimated brightness of 80 lux, will be our benchmark for a
well-lit environment, all environments with a lux level of over 80 will be referred to
as "well-lit". Environments with a lux level below 80 will be referred to as "poorly
lit" from this point on. "Office lighting" and "Full daylight" will mostly apply when
performing surveillance during the day, as most offices’ artificial lighting is limited
throughout the nighttime. These values lie well within the requirement of a well-lit
environment. It is important to note that "Office lighting" and "Hallway" are not
the only light levels covered by the target demographic, but those are the most
prevalent; other non-mentioned lighting conditions are "Overcast day", "Very dark
2.3. Environmental Challenges
6
overcast day", and "Minimal street lighting", among others.
Figure 2.2: A table showing the estimated brightness of environments in lux[13].
The issue of capturing useful images can be tackled by either ensuring proper
lighting conditions in the operational environment, or by equipping the system
with an array of different cameras that are useful in different scenarios and under different lighting conditions. If a patrolling security robot is examined as an
example, it is safe to assume that the robot will be looking for either intruders or
signs of intrusion. This means that an infrared camera could be used to record
heat signatures when lighting conditions do not allow for a clear identification of
the intruder. Another method of assisting navigation and identifying intruders in
poor lighting conditions is using depth sensors to detect movement. This sensor
is also useful when characterizing the topography of the environment, as well as
detecting obstacles. However, these additional sensors naturally drive up the total
production cost of each unit. Some of these sensors will be further elaborated in
chapter 2.6.2.
2.3.2
Sound Conditions
Much like systems that utilize vision, systems that depend on audio input need
appropriate conditions for optimal operation. An audio-perception system in a
noisy facility would not work as optimally as in a quiet office building since the
incoming audio anomalies are much easier to distinguish in a quieter environment.
It is safe to assume that the system will be looking for anomalies such as windows
7
2.3. Environmental Challenges
breaking, doors opening, and closing, as well as unexpected sounds of movement.
If the environment is especially noisy, the sounds that the system would like to
detect are easily drowned out by background noise.
When using sound sensors, there are generally two ways to utilize them. One way
is to use Sonar mapping, i.e. actively producing noise at a certain pitch and using
return times to map out the environment. Another way is to analyze the sound
from the environment, to calculate the direction of the sounds of interest.
2.3.3
On Sound Diffraction
Before examining audio sensors, this section will give an overview of the properties of sound. This will give a broader understanding of why one might use sound
sensors.
It is a well-known fact that sound is capable of bending around corners. This is
due to the wave property of diffraction, which occurs for all types of waves [18].
As an example, single-slit diffraction will be examined below.
When speaking of single-slit diffraction, it is often light that is taken into consideration. In this case, the below equation [18](equation 2.1) is often used to determine
the position of fringes.
sin(θ ) =
mλ
, (m = ±1, ±2, ±3, ...)
a
(2.1)
In equation 2.1, θ is the angle from the slit to the center of the m’th dark fringe on
the screen, λ is the wavelength, and a is the slit width.
In addition to this, the intensity at different angles can be calculated as seen below
(equation 2.2) [18].
sin(π sin(θ ) a ) 2
λ
(2.2)
I = I0
π sin(θ ) λa
In equation 2.2 I is the intensity at a given angle θ and I0 is the intensity at θ = 0.
The application of equation 2.2 at different ratios of λa is explored in figure 2.3.
In figure 2.3 it can be seen that all functions have a central peak around θ =
0. This peak will henceforth be referred to as the central intensity maximum.
Additionally, it can be seen that, as λa increases, the central intensity maximum
becomes wider.
One thing that is apparent from equation 2.1, is that it is only usable when the ratio
8
2.3. Environmental Challenges
λ
a
is less than or equal to 1. If λa > 1 there would be no solution to the equation.
This does however open the question of what happens when the wavelength λ
is larger than the slit width a. From equation 2.2 it can be seen, that the central
intensity maximum would extend beyond 180◦ [18]. This is relevant since most
sounds have a large wavelength. The sound waves of human speech generally
have wavelengths of 1 m or greater[18]. Assuming that a doorway is 1 m or less
wide, this would be a case of the wavelength being longer than the width of the slit
(the width of the doorway). In this case, the sound could easily travel through the
doorway and spread into whatever room the doorway connects to, even without
taking sound reflection into account.
1
a=λ
a = 3λ
a = 10λ
0.8
I/I0
0.6
0.4
0.2
0
−40
−20
0
20
40
θ (degrees)
Figure 2.3: Graph of intensity spread of three different
2.3.4
a
λ
ratios.
Using Multiple Sensors
One way to circumvent the issue of poor operational conditions can be to employ a
variety of sensor types. In this way, when one sensor has poor conditions, another
may have better conditions. This could, for instance, be in a dark quiet environment. Here, the visual sensors would be impaired, but the sound sensors would
not. Additionally, under optimal conditions, the sensors could supplement each
other, making the robot better at sensing in general. In the case of utilizing cameras and sound sensors, this could mean that the sound sensors could detect sound
2.4. Alternative Solutions to assist Security Guards
9
anomalies outside of the camera’s view, and the camera could detect anomalies not
making audible noise.
2.4
Alternative Solutions to assist Security Guards
This section will discuss solutions for security that can help assist security guards
or limit the amount of work required on-site.
2.4.1
A Camera-microphone solution
Using a set of cameras to monitor a perimeter is generally the most common
method. These camera feeds are usually either actively monitored, or stored for
reference, in the event of an intrusion. The cameras are occasionally equipped with
microphones to provide additional insight when detecting intruders. However, a
problem arises when the sound has to be reviewed; either a guard has to listen to
sounds from many different cameras one at a time with a high risk of missing the
anomalies or the sound has to be stored with the footage for later, making it, in
most cases, useless for catching the intruder in the act. It is possible to create an
algorithm that can search all the camera microphones for anomalies at once, but
anything but a perfect algorithm will introduce false positives. However, the algorithm would allow the guard to only listen to the relevant camera microphones,
but the cameras can’t investigate the sounds any further, as a robot solution could,
and the guard will likely have to do that themselves. This will take the guards’
attention away from the cameras, which is time the intruders can use to get past
the cameras.
2.4.2
A Mobile Robot Solution
A mobile robot can be equipped with a camera and a microphone array giving it
the same capabilities as a security camera with a microphone. The difference is
the ability of the robot to move around in its environment. This allows the robot
to investigate possible sound anomalies and even locate an intruder. The robot is
still unable to intercept an intruder meaning there is always a need to have at least
one security guard, but they will not have to waste their time investigating false
anomalies. To accomplish these tasks the robot will need a navigation program to
find its way around. One way to do this is to teach the path by manually moving
the robot around the path, but there are many ways. It is important to limit the path
of the robot so that it only moves where it needs to. The robot should also move
10
2.5. State of the art
in unpredictable ways to confuse or surprise the intruder and hopefully deter any
attempt to get past the security. Applying motion detection on the robot’s camera
is also a possibility, even while it is moving, meaning you won’t lose any of the
abilities of the static camera.
2.5
State of the art
When developing new solutions it is wise to look at what the market has to offer in
terms of existing products, to determine the state of the art. As such, this section
will look at state-of-the-art commercial products, concepts, and research to explore
the current market for existing patrolling robot solutions.
Max Speed
Dimensions
Weight
Usage
Route Generation
Navigation
Sensors
Argus S5 [23]
4-6 km/h
1750 x 780 x 1420 mm
185 kg
Outdoor (nighttime)
Pre-programmed
Visual
Thermal, Panorama Camera
Nimbo [3]
16 km/h
660 mm x 580 mm
23 kg
Indoor
Pre-programmed
Visual
Camera
Knightscope K5 [43]
ca. 5 km/h
1587.5 x 850.9 x 914.4 mm
ca. 180 kg
Outdoor and Indoor
Pre-programmed
Visual
LiDAR, Sonar, GPS, Wheel
Table 2.1: Specifications table of state-of-the-art products
2.5.1
Commercial products
The current market for commercial patrol robots is relatively small. The biggest
firm on the market currently is SMP Robotics [15]. They offer a variety of patrolling robots, all with different functions and purposes. This section will compare some of the commercially available products on the market, to get a greater
understanding of what the market offers, as well as get an idea of key features and
performance metrics.
As seen in Table 2.1 the current products on the market are quite similar in almost
every category. However, the way they sense their surroundings is one of the major differences. The Argus S5 uses both thermal and panorama cameras to view its
surroundings, which makes it more efficient in its outdoor dark environment than
a regular camera would be. In contrast, the Nimbo, which is designed to patrol
indoor environments, only uses regular cameras made to detect visible light, as the
expected setting will likely be relatively illuminated. One of the notable differences
is the high maximum speed of the Nimbo robot, which is a 3-4 fold. This is likely
2.5. State of the art
11
due to the decreased weight of the robot. Keeping the robot lightweight means
that the motor has to be less powerful, making the robot cheaper. This decreased
price would also make the robotic solution more attractive for smaller companies.
• Knightscope
The Knightscope is an outdoor patrolling robot, the robot is able to recharge
itself autonomously. It has a max speed of 5 km/h and has 360° vision.
The robot is usually implemented in outdoor environments such as parking
lots, and malls. It is capable of detecting faces and alerts when detecting
known criminals. Companies that use the Knightscope have reported lower
crime rates in the areas where the robot is deployed. This is attributed to the
physical presence of the robot as a deterrent.
• Argus
The Argus is a patrol robot that is capable of visually detecting intruders.
Upon detection, the intruder is warned and the staff of the facility is informed
of the intruder’s presence and location. The robot has facial detection, which
allows it to differentiate a worker from an intruder. The robot can also operate under low light conditions, which means that the designated area can
also be surveilled during the evening and night hours. The Argus functions
optimally as part of a group of patrol robots, as this allows fewer blind spots.
• Nimbo
Nimbo is a multipurpose patrolling robot. In addition to the standard features that most patrolling robots have, the Nimbo is also capable of being
used as a hoverboard. Nimbo is usually deployed in indoor environments
such as warehouses, shopping centers, and educational facilities. It mostly
acts as a moving camera, though it is capable of sounding an alarm when
intruders are detected.
2.5.2
Research
A lot of the newest research on patrolling robots is focused on patrolling logic,
interaction with intruders, and randomization of the patrolling path. This section
will take a look at some chosen research papers and give a short summary of them
in order to gain insight into what the current problems are of patrolling robots, as
well as to generate ideas for future solutions.
• A Survey of Multi-robot Regular and Adversarial Patrolling
In this paper, the researchers took the problem of navigating a dynamic and
12
2.5. State of the art
uncertain environment and tried to implement an algorithm that could be
used in real-time[17].
The algorithm used in the paper can be simplified into three steps
1. The smallest possible rectangle that covers the boundary of the room is
set.
2. The minimum number of circles, with a circumference that encompasses
the sensor area of the robot, that cover the rectangle is placed.
3. A patrolling path is searched along the boundary of the set in a spiral,
an example of this can be seen in Figure 2.4
Figure 2.4: A figure of the coverage path[17]
The benefit of using a spiral as seen in Figure 2.4, is that a patrol using a
spiral shape minimizes the number of repeated circles patrolled. This solution could be upscaled to fit any room size and the radius of the sensors or
functions of the robot. In pseudo-code, the algorithm could look as such
1. Set start point to current point
2. Set all other point A to unvisited loop
3. Find an unvisited neighboring point whose distance to the boundary is
the smallest
4. If no neighbor point is found then mark it as visited and stop at the end
2.5. State of the art
13
5. Mark as visited and set Current point to Neighboring point
6. Loop End
2.5.3 Path Planning using Reinforcement Learning and Neural Networks
[27] When using a robot to patrol a large area, it often has specific points that
must be inspected and surveyed. The travel between these points is often
through areas where surveillance is less necessary, meaning that these distances and travel times being minimized leads to a higher level of security.
Finding the optimal route between these points is simply in nondeterministic
polynomial time when calculating between a low number of points. Nondeterministic polynomial or NP refers to "A decision problem (a problem that
has a yes/no answer) is said to be in NP if it is solvable in polynomial time
by a non-deterministic Turing machine. Equivalently, and more intuitively, a
decision problem is in NP if, if the answer is yes, a proof can be verified by
a Turing machine in polynomial time." However, this task becomes increasingly difficult in larger areas that need patrolling and the number of points
required, as the number of possible routes increases drastically. To find the
optimal route for large facilities, companies have begun to use reinforcement
learning and neural networks to plan this path. This is done by representing the distances and times between points as costs and treating arrival at
the points as a reward. Using this method, the software iteratively finds a
close to the optimal path by minimizing the distance traveled, and maximizing the time spent at the desired locations. This method has been shown
to outperform other methods and can find a nearly optimal path with low
computational expenses up to 100 different patrol points.
2.5.4
State of the Art Conclusion
The research done on the existing commercial products has given an overview of
what has already been produced. This gives an idea of how saturated the market is
and also how well a similar product would do in terms of sales. This information
can be used to avoid redundant solutions that would likely result in a financial
loss. Furthermore, the specifications and functionalities of the existing solutions
also give an understanding of what holes in the market may be present. The final
takeaways from this section are related to the movement function of the robot.
Although this aspect of the robot will not be the main focus of this project, the
2.6. Sensor Possibilities
14
information can be taken into consideration when designing the robot, and for a
fully fledged patrol planning and possibly robot swarm system in the future.
2.6
Sensor Possibilities
For the solution to be able to detect intruders and traverse its surroundings, it
will need several sensors. These sensors can be divided into 2 groups: Visual and
Audio.
2.6.1
Audio Sensors
Dynamic Microphones
Dynamic microphones use an induction coil in a magnetic field to record sound.
This recording method makes dynamic microphones cheap and durable; two things
that are attractive when designing a mobile, commercial solution [25].
Diaphragm Condenser Microphones
Diaphragm condenser microphones use a capacitor to convert vibrations into electrical current, making the microphone highly sensitive. This could be particularly
useful in cases where the source of a sound is quiet or distant, such as glass breaking or footsteps from the other side of a building. Many diaphragm condenser microphones also allow the user to choose the desired polar pattern. This allows for
the use of an omnidirectional polar pattern, giving the microphone/microphones
the ability to listen to 360◦ . However, the price of these microphones is higher than
that of a dynamic microphone [25].
2.6.2
Visual Sensors
Cameras used for image processing systems are usually categorized as either industrial/machine vision (MV) cameras or network/IP (Internet Protocol) cameras,
and both have their benefits and disadvantages.
Network Cameras
Network cameras are frequently used in surveillance applications and sometimes
in combination with industrial cameras. These are typically placed in robust casings designed to withstand harsh weather and jolts, making them suitable for outdoor and indoor use. They usually have a variety of day and night modes and
2.6. Sensor Possibilities
15
infrared filters that deliver high image quality consistently, even under poor lighting and weather conditions. These cameras compress the images they record to
reduce the volume of data being transmitted over the network. These cameras,
when connected to a network, can theoretically have an unlimited number of users
access the feed at the same time [5].
Industrial Cameras
Industrial cameras, send raw uncompressed data directly to the computer to which
they are connected. This computer is then responsible for processing a large volume of incoming data. The benefit of this is that no image data is lost in compression. Industrial cameras are usually divided into two categories, line scan, and
area scan cameras. These are relevant in different computer vision applications
and capture images differently.
Line scan Cameras
Line scan cameras use image capture sensors arranged in a single-, or a couple
of lines of pixels, where the image is captured line for line and finally constructed
into a complete image in the processing stage. Line scan cameras are typically used
for scanning objects that move in front of the sensor, for example on a high-speed
conveyor belt. These cameras are used in many printing, packaging, and surface
inspection industries.
Area scan Cameras
Area scan cameras use rectangular image-capturing sensor arrangements, where
the entire image is captured simultaneously. These cameras are found in many
industries, such as medical, traffic, and security[5].
Colour Camera
Most color cameras work by having a single CMOS or CCD sensor overlaid with
colored filters that cover each of the pixels, making the pixels alternate between
being sensitive to red, green, and blue. The mosaic pattern typically used for this
is called the Bayer pattern. The resulting mosaic contains twice as many green
pixels compared to blue or red because this mimics the greater sensitivity to the
green light of the human eye. The Bayer pattern is illustrated in Figure 2.5.
2.6. Sensor Possibilities
16
Figure 2.5: The Bayer pattern, used in most color cameras[30].
This also means that if a red light hits a cell that is sensitive only to green light,
that information will be lost. Lost information can, through a variety of different algorithms, be interpolated from adjacent cells. This process of combining cell
information to make an image is called demosaicing [30]. This demosaicing conversion from Bayer pattern to RGB is very CPU intensive and is usually done by
the FPGA (Field Programmable Gate Array) of a frame-grabber instead, which is
an electronic component that can carry out this conversion. Single-sensor color
cameras have the advantage that the electronics are identical to a monochrome
camera, where only the sensor has to be modified with color filters, making them
very inexpensive and popular[19].
Monochrome Camera
Monochrome cameras might not seem like the best type of cameras for image
recognition as the image it outputs does not contain any color but only gradients.
This is, however, also the monochrome cameras’ benefit; while it can not capture
any color, it can capture all the light hitting the sensor, which results in a better
quality image with more detail. No demosaicing is needed to create the final image
as well. Many image processing techniques also involve gray-scaling, which is the
process of taking a color image and transforming it into an image consisting only of
shades of gray. Using a monochrome camera would eliminate this process entirely.
Although this eliminates the need for a gray-scaling process, the image output of
the monochrome cameras would also be significantly smaller in byte size than their
colored counterparts. Since the output images are smaller, this also means that the
images require less processing time. The processing time is an important factor to
consider if the goal is to run image processing in real-time. Monochrome cameras
2.7. Subconclusion & Problem Formulation
17
also have better low-light performance, since they are able to take in more light per
photocell, this is also a benefit as there is a chance that the robot will be deployed
in a poorly or non-lit environment[12].
Thermographic Camera
A thermographic or infrared camera, as the name implies, is a camera that creates an image using infrared radiation. This means that it is able to see heat signatures. Thermographic cameras are usually very expensive compared to their
non-thermographic counterparts. A reason for using a thermographic camera is
that they are able to see in total darkness. As it is possible that the robot might
be deployed in a non-lit environment, a thermographic camera might be the only
camera type capable of detecting a human intruder.
On Image Size and Processing Speed
As mentioned above, images are often grayscaled in image processing applications
as a method of reducing the data amount. This is advantageous since it decreases
the processing time of the image.
Another way of decreasing the processing time is to decrease the image quality.
In some applications, high image resolution is not needed, meaning the image
resolution can be decreased without a significant drop in program functionality.
In other applications, a high frame rate may not be needed. If each frame has to
be analyzed, the frame rate is a key factor in the maximum permissible processing
time. Thus, decreasing the frame rate will increase the amount of time that can be
used for image processing.
2.7
Subconclusion & Problem Formulation
There are many reasons to use alternatives to either supplement or partly replace
security guards in property security. This alternative would have to be capable
of locating potential intruders to be effective. The different alternatives to using
security guards have their advantages and disadvantages. In this project, a mobile
robot platform is chosen as the security solution of choice, due to the increased
flexibility of this solution. Additionally, this project will focus on the perception
system of this solution, as this is a more complex matter. From this, the following
problem formulation is formed:
2.7. Subconclusion & Problem Formulation
18
How can a perception system be made to detect humans
for a mobile robotic platform?
The system will be utilizing both sound and visual sensors since this has been
deemed as a better system according to section 2.3.4. When specifically considering the sound sensors, these will focus on detecting sound anomalies (sounds of
interest) from the environment.
In this report, a prototype will be constructed and validated by a use case described
in chapter 7. The results of this validation will be discussed in Chapter 8. Here, the
solution will be compared to both current robotic solutions, along with the use of a
standard human security team, to determine the validity of the proposed product.
Chapter 3
Requirements
This chapter will outline the requirements of the solution, which will be the groundwork for the actual system design. All these requirements will be design requirements intended to showcase the desired functionalities of the system. These will
later be addressed and converted into functional requirements with measurable
success criteria.
No. System Requirement
1.1
ISO Compliance
1.2
Autonomous Navigation
1.3
Obstacle avoidance
1.4
Positional Awareness
Description
The robotic system shall be in compliance with all
relevant ISO standards.
The robot must be able to navigate its designated
known environment without human interference.
This will include navigating between some predetermined points.
The robot must avoid and navigate around 95%
objects in its chosen path.
When traveling in a known environment, the
robot must know its current approximate position
Table 3.1: General requirements of the mobile robot platform
The general system requirements of the robotic mobile platform are outlined in
table 3.1. These requirements will not be directly addressed in this report, but they
do affect the requirements of the perception system.
The requirements of the perception system are outlined in table 3.2. These requirements are the main groundwork for the solution concept. These requirements will
be further addressed in Section 4.4.2, which will outline success criteria based on
19
20
the chosen solution. Anomalies are defined in section 5.1.
No.
Perception
System
Requirement
2.1
Anomaly Detection
2.2
Anomaly
tion
Classifica-
Description
When an environmental anomaly occurs within
range (of the robot’s sensors), the robot must detect the anomaly.
When an anomaly is detected, the robot must
identify if the anomaly is a human.
Table 3.2: Requirements for the perception system
Chapter 4
Design Concept
This chapter will cover the different possible design possibilities for the microphone array and camera setup. Only microphone and camera use will be considered as relevant possibilities, as no other reasonable sensor types have been
identified. Additionally, the possibility of using multiple sensor types will be considered. Lastly, the chosen designs will be combined into a holistic design concept
in section 4.4.
4.1
Camera Setups
This section will cover single camera setups and camera arrays. These setups can
be combined with any of the camera types mentioned in section 2.6.2.
Single Camera
The single camera is the simplest setup, the benefit of the single camera is the ease
of use and setup. No calibration is needed, as there is only one input in this setup.
Another advantage is decreased production cost. There are, however, some limits
when using a single camera, such as the resolution and frame rate being limited to
what the camera is able to output directly. However, as mentioned in section 2.6.2,
this quality may not be needed. Using a single camera also limits the field of view
of the optical system to that of the single camera. This can be detrimental when
used in solutions where the surroundings of the robot are of concern. However,
many stationary surveillance cameras use lenses, like the fisheye lens, to increase
their field of view to be able to cover additional areas.
21
4.1. Camera Setups
22
Figure 4.1: An example of the increased field of view that a fisheye lens.[31]
Camera array
A camera array is a collection of cameras calibrated to produce a single image.
Usually, individual cameras are of lower cost, but with calibration and software
processing, it is possible to combine the lower-quality images into high-quality images or, alternatively, a picture of a larger field of view than a single camera could
achieve. This can be very useful if there is a need for an image of all directions in a
given environment. Though the setup of the camera array is relatively simple, the
calibration and software processing of the array is out of scope for a project such
as this, due to its complexity.
Figure 4.2: An example of an image captured on a camera array, spliced from multiple images.[44]
4.2. Sound localization
4.2
23
Sound localization
Sound localization is a field within signal processing that deals with identifying
the origin of a detected audio signal, with respect to an array of microphones[32].
The ability to estimate the direction of a sound is vital to many biological organisms, where it serves as an alert to dangers and predators or, in predators, is used
to locate prey. Sound localization also has many different engineering applications
and has become a large and complex field, in which humans attempt to recreate artificially that which the animal kingdom has perfected[36]. Sound localization is an
important field that has seen many different applications such as sound source separation, soundtracking, and speech enhancement technologies. In robotics, it can
be useful to be able to determine the direction of and distance to a sound source,
especially in social- or security robotics. This section will explore and analyze
the advantages and disadvantages of some existing sound localization methods to
determine which could prove the most viable in a mobile security robot.
Basic Principles
Typically, sound localization in electronic systems is done by using two or more
microphones in an array and using the difference in arrival times of a sound at the
two microphones to determine the direction of arrival (DOA). This time difference
is called the interaural time difference[32]. The accuracy of a microphone array’s
ability to determine direction is fundamentally limited by the physical size of the
array. If the microphones in the array are too closely placed together, the interaural
time difference will be near zero, making mathematical estimation of direction
extremely difficult. It is not uncommon for the distance between microphones in
the arrays to be 10-30 centimeters apart, which has consequences for the size of the
array[36]. Physically large arrays can become impractical to use on small robotics,
and even for large robots, such microphone arrays can be inconvenient to mount
and maneuver. Large separation between microphones is required to detect lowfrequency contents of audio signals, but small distances are required to address
spatial aliasing. This poses another challenge for designing microphone arrays, as
the spacing between microphones is not arbitrary. Spatial aliasing will be further
elaborated in section 5.1. The precision of sound localization using microphone
arrays has been found to increase with the use of more microphones, which in turn
increases the cost of the array[32]. This is an example of some of the problems that
are encountered in the physical setups of sound localization microphone arrays.
4.2. Sound localization
24
Monoaural Localization
Monoaural localization refers to the use of a single "ear" or microphone to determine the direction of a sound source. As mentioned previously, sound localization
in artificial systems is typically done by using two or more microphones. In contrast, being able to use a single microphone holds the potential to decrease both
the size and cost of a microphone array significantly[36]. Sound localization with
a single microphone, however, is very inaccurate and complex because it requires
prior knowledge of possible sounds, and in a narrow mathematical sense, it is
actually impossible to determine the direction of sounds with a monoaural recording alone[36]. To combat this, monoaural microphone arrays typically use artificial
pinna or auricles, which refers to the outer part of the ear in animals which can be
seen in figure 4.3.
Figure 4.3: An illustration showing how sound from different directions is affected by an artificial
pinna structure [46].
The pinna is able to change the way a known sound is perceived and change
the spectral shape of the sound depending on the direction it is coming from.
Humans are automatically trained to recognize this change throughout their lives
and become better at sound localization of known sounds through exposure, but
recreating this artificially is very difficult. Some studies have suggested using machine learning to train an algorithm to be able to do this reliably with relative
success[36]. This sound localization setup, however, has proven to have some challenges[37]. Firstly, the use of an artificial pinna means that there exists a range
4.2. Sound localization
25
of angles from which the algorithm has difficulty estimating the direction of the
source of the sound. In a Stanford University test [37], this problematic angle
ranged from 235◦ to 345◦ , which constitutes nearly a third of the possible sound
directions. Additionally, the average error of the experiments ranged from 4.3◦ for
wideband noise-like signals to 18.3◦ for naturally occurring sounds such as dog
barks.
Assuming that the audio source is constant and stationary, it is possible to perform monoaural audio localization by moving the microphone array. By introducing movement to the microphone array, it is possible to emulate having multiple
microphones in the array, because the changing position yields readings from different positions relative to the audio source. Mathematically, assuming accurate
position estimation is indistinguishable from the sound source localization arithmetic used for estimating direction in multi-microphone arrays, with the exception
of compensating for the travel time between samples.
Figure 4.4: Example of monoaural localization with movement
Binaural Localization
Binaural literally means "having or relating to two ears" and binaural localization
refers to sound localization by using two microphones. Binaural localization primarily uses interaural time differences as a cue for sound localization. This is a
phenomenon used in most mammals to determine sound direction in the azimuth
4.2. Sound localization
26
plane. However, this binaural cue cannot be used to determine the elevation of a
sound source as it suffers from front/back ambiguity[36]. A sound source placed
directly in front of a binaural microphone array is indistinguishable from a sound
source placed directly behind the array, as the interaural time difference is zero
in both cases. Much like the monoaural microphone arrays, binaural microphone
arrays can use artificial pinna to distort and reflect known sounds depending on
the sound source’s direction, making it possible to more accurately determine the
direction of the sound source in three-dimensional space. This is done using a
Head-Related Transfer Function which also takes into consideration the shape of
the head when calculating the time it takes the sound to reach the furthest microphone [26].
The ambiguity of this solution can be avoided by using more than two microphones. However, more microphones come with more complex algorithms and a
need for larger computational capacity. By having more than two, but still minimizing the use of microphones it is possible to reduce the complexity of both the
algorithm and the computation [32]. Tests of binaural microphone arrays were
made using three different methods.
Multi-microphone array
Multi-microphone arrays are comprised of three or more microphones. This is generally for the purpose of increasing accuracy and removing the need for pinnae.
This does however come with an increased cost and complexity; both arising from
the increased number of microphones. While adding more microphones generally
will add a higher level of accuracy, it will also increase the costs of the system,
both directly, through having to buy more microphones, and indirectly, through
needing more processing power and more complex code[22].
A multi-microphone array uses the same fundamental principle for sound localization as a binaural microphone array, using the interaural time difference[22].
To determine the direction of the sound in 3D space, the minimum required number of microphones is four. These microphones should be set up as the vertices in
a tetrahedral shape. This setup eliminates front-back ambiguity across all planes.
If the elevation of the sound is not needed, sound localization can be achieved
using only three microphones forming a triangle lying in a position parallel to the
azimuth plane. Some high-end microphone arrays use ultradirective microphones
arranged on a sphere, which allows for a very robust sound localization setup.
4.3. On the use of multiple sensors
27
One of these types is the Eigenmike ® , which is a spherical 32-microphone array
that is able to detect and isolate multiple sound sources with selective hearing.
4.3
On the use of multiple sensors
The main reason for the use of multiple sensors is that it allows for confirmation
of anomaly detection by one of the sensors. When only using audio, there is a
significant chance of false positives when detecting sound anomalies that are not
intruders. Additionally, there is no way to locate an intruder if the intruder is not
making enough sound. When only using video, there is a limitation in the area
that is surveyable, limited by the camera’s field of view. As such, the camera has
difficulty covering a large area simultaneously. However, when using video and
audio in conjunction, the severity of these problems is reduced. A solution using
this multisensor method can use one detection method as a confirmation of the
anomaly detected by the other.
4.4
4.4.1
Concept Selection
Microphone Selection
Because of the inherent inaccuracy of audio source detection using a single microphone setup, this option can be discarded as inviable. The Binaural microphone
array requires the use of pinnae, which makes it a reason to discard this option as
well, due to needlessly increased complexity. Based on the information available,
a 3-microphone array has been determined to be the most suitable solution for the
implementation needs of this project. Fewer microphones mean less space and resources required, along with a decreased complexity. Using a 3-microphone setup
also removes the main issue with a 2-microphone setup, being that calculations
using the TDOA method give two potential anomaly source locations. It should be
noted that only using three microphones will have a consequence on the accuracy
of the system compared to a system using 8 microphones [29]. Three microphones
also limit sound source detection in the azimuth plane, meaning it lacks the ability
to locate whether the sound is above or below the microphone setup.
In this project, the three microphones are positioned on the points of an equilateral
triangle to ensure equal sensitivity to all angles in the azimuth plane.
4.4. Concept Selection
28
Camera Selection
For the camera selection, there are a lot of options. Usually in motion detection
and object recognition webcams or Kinect cameras are used, however, the expected
methodologies (explained in Section 5.3), only require the use of an RGB camera.
This means that the larger financial investment of a Kinect camera array is unnecessary. The lower cost of an RGB camera gives the robotic solution a decreased
production cost, making the solution more financially viable, and a more realistic
option for smaller companies with less financial flexibility.
4.4.2
Functional Requirements
In this chapter, functional requirements will be outlined based on the requirements
of Chapter 3 (see Table 4.1). These requirements will continue in the format of an
identifying number, a title, and a description. The description will also include the
original requirement upon which the functional requirement is based.
29
4.4. Concept Selection
No. System Requirement
3.1
Precision of Sound
Anomaly Detection
3.2
Recall
of
Sound
Anomaly Detection
3.3
Sound
Anomaly
Identification
3.4
Human Identification
in well-lit environment
3.5
Human Identification
in poorly-lit environment
Description
When a sound anomaly occurs within 5 meters of
the perception system, the robot must detect the
anomaly with 90% precision. Based on Requirement 2.1.
When a sound anomaly occurs within 5 meters of
the perception system, the robot must detect the
anomaly with 90% recall. Based on Requirement
2.1.
When a sound anomaly is detected, the perception system must identify the direction of the
sound with a deviation within half the FOV of
the camera minus 10° in the azimuth plane. This
should ensure that any movement is in line of
sight of the camera. Using this deviation the
sound source localization must be able to identify the anomaly angle within the deviation 90%
of the time Based on Requirement 2.1.
When a human is seen on the camera feed, the
perception system must correctly identify the human 95% of the time in a well-lit environment.
Based on Requirement 2.2
When a human is seen on the camera feed, the
perception system must correctly identify the human with 80% of the time in a poorly-lit environment. Based on Requirement 2.2
Table 4.1: Functional Requirements of the robotic system
Chapter 5
Methodology
This chapter will cover different methods for sound anomaly detection, sound
source localization using one, two, or multiple microphones and motion detection.
It also includes which of the methods chosen for anomaly detection, sound source
localization and motion detection in this project.
5.1
Anomaly Detection
Anomalies are defined as something that deviates from what is normal or expected,
which in relation to this report can be sounds of windows breaking, forced entry,
or video feeds of suspected burglars. Automatic video and audio analysis can detect anomalous patterns in surveillance videos and is called anomaly detection[28].
Anomaly detection is very useful, as it can serve as a pre-alarm or a signal to security personnel that they should monitor a certain camera feed. This significantly
increases the amount of surveillance a single person can perform[35].
5.2
Audio Analysis
Audio analysis in particular has emerged as a relevant tool for improving the security of public and private assets. In fact, in many cases, the analysis of audio
signals from microphones in a surveilled area deployed to detect anomalous audio
signatures has been proven to be more reliable than the video analysis counterpart
of the same area[10].
Audio analysis refers to the extraction of information and meaning from audio
signals for analysis, classification, and storage. Audio analysis extracts data that
30
5.2. Audio Analysis
31
represents analog sounds in digital form, preserving the main properties of the
original sound. Sounds have three key characteristics to be considered when analyzing. Time period, amplitude, and frequency. Audio signals can be visually
represented and are most commonly done by converting the signal to the waveform, spectrum, and spectrogram representation, seen in Figures 5.1, 5.2 and 5.3.
Waveform
The waveform is the most common way of representing sound and is encountered
in most recording software. The waveform representation is a graph that maps the
amplitude of a sound over time.
Figure 5.1: A visual representation of the waveform of a signal[24].
Spectrum
A spectrum representation shows the frequency content of a sound signal, where
the frequency is represented along the x-axis and the amplitude of the signal is on
the y-axis. Natural sounds contain a wide range of different frequencies. Tonal
sounds contain a fundamental frequency and a range of overtones that are multiples of the fundamental frequency. Usually, it isn’t possible to hear the individual
overtones as the fundamental frequency and the overtones combine, and their individual amplitudes and the relationship between them play an important role in the
perceived tone color or timbre of the tone. The tone color is what makes it possible
to distinguish between the same tone originating from different sources. These
fundamental tones and overtones are visualized in the spectrum representation in
figure 5.2.
32
5.2. Audio Analysis
Figure 5.2: The spectrum of a sound signal, with fundamental frequency and overtones marked[24].
Spectrogram
A spectrogram is a visual representation of the spectrum of frequencies as it varies
over time. Spectrograms are also known as sonographs or voiceprints. Spectrograms are used extensively in the field of audio analysis with many different applications. As seen in Figure 5.3, the spectrogram is usually represented as a heat
map, where the intensity is shown by varying colors in the image.
Figure 5.3: A spectrogram of a sound signal[24].
Visual representations are rarely sufficient to extract meaningful information
about an audio signal, and a numerical approach is necessary. In almost any audio
analysis, the Fourier transform plays a large role.
5.2.1
Fourier Transform in Audio Analysis
The Fourier transform(FT) is a mathematical transform that decomposes functions
into frequency components, which is represented in the output as a function of
frequency. That is, often from the time or space domain to the frequency domain
and vice versa. This is useful in many types of signal processing, as it allows one
33
5.2. Audio Analysis
to isolate certain frequencies from a signal to suppress, enhance, or analyze them.
An example of this particular application of the Fourier transform, is signals with
high-pitched noise, where the high pitches can be isolated and suppressed and the
use of the inverse Fourier transform can reproduce the signal without the noise.
In short, the Fourier transform makes it possible to view a signal as the sum of
several pure sine waves of different frequencies and amplitudes. An example of
the Fourier transform output can be seen in Figure 5.4.
Figure 5.4: Visualisation of the output of the Fourier transform[6].
One of the common conventions for defining the Fourier transform of some integrable function f : R −→ C is the following:
Z ∞
−∞
f (t)ei2πξt dt ∀ ξ ∈ R
The Fourier transform works by mathematically winding the graph of f around
the origin of a cartesian coordinate system with a variable winding frequency ξ.
This will be referred to as the winding frequency. The graph is assigned some
point, which denotes the center concentration of mass of the wound graph. When
the winding frequency approaches the frequency of f or the frequency of one of
the components of f , the center of mass will become noticeably displaced from the
origin along the x-axis. The displacement of this point from the origin is mapped
to another graph as a function of the displacement and winding frequency. This
function is the output of the Fourier transform. In figure 5.5 the original signal
with frequency 3 can be seen in yellow, and the corresponding windings at different winding frequencies can be seen below. The x-coordinate position of the
center of mass can be seen as a function of winding frequency in the red graph.
5.2. Audio Analysis
34
It can be seen that the displacement is significant at exactly the frequency of the
signal. This is what makes the Fourier transform able to decompose signals into
their components. This also works for composite signals, where the displacement
peaks would be found at the frequencies of the components.
Figure 5.5: Visualization of how the Fourier Transform works[2].
A digital computer cannot work with continuous-time signals directly, so it is necessary to take some samples and analyze these samples instead of the original signal. This yields a discrete sequence of samples, sampled at some frequency from
the Nyquist-Shannon theorem. The Discrete Fourier Transform (DFT) is the discrete version of the Fourier transform that transforms a discrete sequence, like a sequence of samples, from the time-domain representation to the frequency-domain
representation. A more popular use of the DFT is the Fast Fourier transform (FFT)
which is any efficient computation of the DFT or its inverse. A fundamental flaw of
the discrete Fourier transform is, that it is computationally intensive, as it requires
O(n2 ) computations. However, the fast Fourier transform uses clever mathematics to reduce this to O(n log n) computations, meaning that a Fourier computation
with the discrete Fourier transform that would have taken over 3 years could be
done in 35 minutes with the FFT.
35
5.2. Audio Analysis
5.2.2
Using The Fourier Transform To Detect Sound Anomalies
To detect a sound anomaly, it needs to deviate from normal sounds. This is where
the Fourier transform is useful since it allows analyzing of each individual frequency in a sample of sound. To detect a sound anomaly, a short sound sample
is recorded and then a Fourier transform is performed on it to split it up into its
frequency components and their respective amplitudes. To then determine if the
recorded sample has unusual sounds, a sort of sound template has to be created
that the sample sound can be compared to. One way of creating the template is to
record a large number of background noise samples, perform Fourier transforms
on them, and then save the largest values that are found at each frequency. The
template will then consist of all the loudest background noises. This means that it
can also be used in places where there are somewhat loud noises but still be able
to tell if there are more quiet sound anomalies.
5.2.3
Nyquist-Shannon Sampling Theorem
Audio signals are continuous-time (analog) signals, which can be stored on computers in the form of discrete equidistant points, called samples, in a function of
discrete time or space. The higher the sampling rate, the higher the accuracy of
the signal reconstruction and stored information. However, high sampling rates
generate large volumes of data to be stored and processed, which can require a lot
of computational power to handle. Musical audio signals, for instance, are rich in
high frequencies and require high sampling rates upwards of 44.100 samples per
second[9], while other signals require only a fraction of this. Therefore, it is crucial
to select an appropriate sampling rate for a given signal in order to record sufficient information about the signal to recreate and analyze it. The question then
arises: What is the minimum necessary sampling frequency for a given type of signal that allows for accurate reconstruction and preservation of data? The answer
is provided by the Nyquist-Shannon sampling theorem, which states that:
"The minimum sampling frequency of a signal, so that it will not distort its underlying information, should be double the frequency of its
highest frequency component."[9]
Suppose that X (t) is a band-limited signal. Bandlimited means that for the Fourier
transform of this signal, X̂ ( f ) = F { X (t)}, there would be a certain f max for which
| X̂ ( f )| = 0 ∀ | f | > f max
36
5.2. Audio Analysis
so that there is no power in the signal beyond the maximum frequency f max . The
Nyquist theorem then states that to sample this signal, it would be necessary to
sample with a frequency larger than or equal to twice the maximum frequency
contained in the signal, that is:
f sample ≥ 2 f max
If this is the case, no information is lost during the sampling process, and the
original signal could theoretically be reconstructed from the sampled signal.
5.2.4
Aliasing
Aliasing is the effect that happens when different signals appear similar when sampled. It can also occur when the reconstructed signal from the samples is different
from the original continuous signal. An example of aliasing can be seen in figure
5.6.
Figure 5.6: An example of aliasing[9]
The blue wave is the signal that is sampled, and the red bars are where the signal
is sampled. The green sine wave in figure 5.6, is the reconstructed wave from the
sampled points. The green wave is aliased. This occurs when the sampling frequency is less than two times the highest frequency from the sampled frequency.
As is obvious from the image, the green wave is not an accurate representation of
the blue wave, this proves the importance of a high sampling frequency to obtain
an accurate representation of the sampled signal.
37
5.3. Motion Detection
5.2.5
Spatial Aliasing
Spatial Aliasing is a type of aliasing. For example, it can be a problem when trying
to locate a sound source using a microphone; it will occur when the distance p
between microphones in a linear setup is phased aligned with the sound source.
This happens if the wavelength of the sound source happens to be p. This can lead
to direction ambiguity and, as such, must be addressed. The ambiguity can be
addressed in one of two ways[48]. Either of these two conditions needs to be met
2p < λ
(5.1)
v
2p
(5.2)
or
f <
Where v is the speed of sound, λ is the wavelength of the sound, and f is the
frequency of the sound
From the equation 5.2 it is evident that in a linear microphone setup, the maximum
frequency that this linear array can handle without ambiguity is f < /(2p). So
while spatial aliasing might not pose a problem, it is something to be aware of.
5.2.6
Sampling Rate
The human ear has a best-case frequency range from 20 Hz to 20 kHz and it is very
likely that the majority of anomalous sound events can be captured entirely in this
spectrum. As stated by the Nyquist-Shannon theorem, the minimum sampling
frequency of a signal, that does not distort or lose its underlying information,
should be double the frequency of its highest frequency component. As the highest
frequency component in this spectrum is 20kHz, it is necessary to sample at a
frequency of at least 40 kHz in order to be able to record and reconstruct the
signal. Music, for instance, is typically sampled at 44.1 kHz and recorded at 48 kHz
to leave room for the anti-aliasing filter that is used in analog-to-digital converters.
Because of this, the microphone array for this project will be sampled at 48 kHz,
which is supported by the field recorder mentioned in section 6.1.2.
5.3
Motion Detection
Motion detection is the process of detecting and tracking the movement of objects
or persons, in relation to their surroundings, in a video feed. There exists a plethora
5.3. Motion Detection
38
of different methods for motion detection, and this section will outline three of the
most common methods, after which one will be chosen for use in this project.
• Background Subtraction:
Background subtraction works by comparing frames and subtracting one
frame from another. Usually, a background frame is chosen and all other
subsequent frames are compared to the background frame. Compared, in
this respect, means taking the absolute value of the subtraction of the other
frame to the background frame, this is to avoid integer underflows as an unsigned 8-bit integer only goes down to 0. This results in an image where
the differences between the background frame and other frames are obvious.
This frame can then be thresholded to make the differences more evident.
The thresholded image can then be used to draw bounding boxes around the
changes between the background frame and the other frames. Though it is
very simple to implement, background subtraction has its drawbacks such as
it being excessively sensitive to scene changes such as light, or other foreign
events. There exist background methods that can combat these problems,
they will however not be mentioned here.[38].
• Temporal differencing:
Temporal differencing shares some of its process with background subtraction, but is still different. The main difference between the two methods is
that in temporal difference the current frame is compared with the previous
frame. This makes temporal differencing better suited for a non-stationary
camera, as it doesn’t rely on a predetermined background for finding motion. Temporal differencing is very robust in dynamic environments. Although temporal differencing can suffer from poor performance in extracting
all relevant feature pixels (i.e. pixels that include the features that should be
identified) of the object of interest, usually techniques such as morphological operations and hole filling are used to rectify the aforementioned problems.[38]
• Optical flow:
The optical flow method can be used to detect moving objects and even their
direction and velocity. Simplified, the general technique works by tracking
pixels over a short span of time and then calculating the image derivative
and drawing a vector on it. This vector contains the direction and velocity
of the object[33]. It should be mentioned that optical flow motion detection
is computationally intensive, while at the same time being very sensitive to
5.3. Motion Detection
39
light and scenic changes. This makes it unideal for a moving robot[38].
Conclusion
Based on the previous section, background subtraction has been chosen for motion
detection. This is because it is easy to implement and the process is very well documented. Optical flow is too computationally intensive to implement on a turtle
bot. Though background subtraction has its own challenges such as its sensitivity
to light and environmental changes, so while not a perfect candidate it is the best
of the three mentioned methods.
A more in depth method for doing background subtraction
Image- and video processing follow very similar procedures, the main difference
being that an image is only a single frame while video processing is a stream of
images. When doing any sort of image processing in regards to motion detection,
it is common practice to grayscale the input, as this can cut the processing time to
1/3. After gray-scaling the image, it should be blurred with a Gaussian filter to
smooth the image. This is because it averages the pixel color intensity, which is
important as it smooths out high-frequency noise that could interfere with motion
detection. This high-frequency noise often originates directly from camera input,
usually dark regions of an image will contain most of the high-frequency noise.
The images are then compared by selecting a "first frame" and comparing it with
the subsequent frame. This is done using image subtraction and thresholding to
reveal regions of significant changes in pixel values. A diagram of the process can
be seen in 5.7.
Figure 5.7: A diagram of the background subtraction process [16]
5.4. Auditory localization
40
Using contour detection, it is possible to find the outlines of these regions in
the thresholded image. Lastly for easy visualization, a bounding box is drawn
around the area of motion. This works well for stationary cameras, but as the robot
will be moving, traditional background subtraction is not sufficient. However,
there are techniques for adapting background subtraction to moving cameras. One
technique to adapt background subtraction to a moving camera is to compensate
for the global motion of the camera as if the camera was stationary. This can be
done using block matching. In general, block-matching works by first dividing the
frames of the video into blocks, the algorithm will then try and match these to the
previous frame. If the algorithm finds a match, a vector is created that skews the
block onto the previous block. In theory, this vector could be used to compensate
for the global motion of the camera.
5.4
Auditory localization
In regards to auditory localization, you first need to make some assumptions about
the spreading of sound waves, also known as the propagation of sound, to simplify the complex nature of the subject. There are two assumptions that should
be made, firstly far field versus the near field and secondly free field versus the
diffuse field. One of each should be chosen, and the combination will simplify the
complex matter of sound enough to perform calculations on it.
In the near field, it is assumed that the sound source is very close to the microphone, within one wavelength. Within this distance, the sound waves are complex
because they circulate back and forth, never escaping. This means there is no fixed
relation between pressure and distance. In the far field, you assume that the sound
is further away, between 1 wavelength and infinity. When the sound source is this
far away from the microphone array, it is safe to assume the wavefront as being
perpendicular to the sound source [40].
The difference between free field and diffuse field is that free field assumes that
there is nothing around to reflect the sound waves back, which simplifies the calculations because there are fewer waves to account for. The diffuse field assumes
that there are walls around to reflect the sound back to the listener multiple times,
making it appear like there is no single sound source [40]. Once you have assumed
either near field or far field and free field or diffuse field you can choose which
method to use to determine the Direction of Arrival (DOA). The most common
way of finding the DOA of sound is using the time difference of arrival (TDOA)
41
5.4. Auditory localization
between a number of microphones.
TDOA is often used because of its ability to be applied to broadband signals (wide
frequency range), but also because of its accuracy and simplicity, which means that
it uses very little computational power. For these reasons, the TDOA method was
chosen to find the DOA in this project.
There are different methods of using this TDOA to find the DOA, the two most
common are triangulation and steered response power (SRP). Triangulation uses
the geometry of the microphone positions to calculate the DOA, while SRP is a
beamforming-based method. Due to the simplicity and low computational requirement of triangulation, it was chosen for this project.
5.4.1
Determining TDOA
To determine the time difference of arrival(TDOA) of audio signals recorded from
different microphones with a short relative physical displacement, it is common
to use audio cross-correlation (or cross-variance). Cross-correlation consists of the
displaced dot-product of two signals and it is most commonly used to quantify the
degree of similarity between two signals. It is used, in other words, to compare
two signals. As the sample index n of the correlator is incremented, the output
of the correlator is a similarity score that compares two signals at two different
time shifts. This produces two important results that first imply how much one
signal resembles the other at any given time shift, but secondly, also at which
timeshift the peak similarity is. This means that the value of the cross-correlation
will be maximal when the signals are time-aligned, resulting in the most amount
of overlap. For signals that have been evaluated in discrete time, the correlation
between two signals x and y with the same N samples length is expressed by the
following expression:
N
Corr { x, y}[n] =
∑
x [m] · y[m + n]
m =1
Once cross-correlation has been performed on two sound signals, it yields
a graph of similarities at different time shifts. The largest peak in the crosscorrelation is usually very close, or equal, to the sampling rate of your microphones. The index number is then used in this equation
TDOA =
IndexO f Max − SamplingRate
SamplingRate
(5.3)
42
5.5. Farfield DOA proof
which yields a number, in seconds, that is usually very low and equal to the
TDOA.[4]
5.5
Farfield DOA proof
The positions of the microphones are set up in an equilateral triangle. The microphones’ positions P1 , P2 , P3 are defined in equation 5.4 and are shown graphically
in figure 5.8a.
" #
0
P1 =
,
0
#
"
− sin( π6 )
,
(5.4)
P2 = D
− cos( π6 )
"
#
sin( π6 )
P3 = D
− cos( π6 )
Where D is the edge lengths of the equilateral triangle.
The sound wave is assumed to be from a large enough distance, so that it can be
considered as a straight line or a wall, with direction ⃗v:
"
#
sin( φ)
⃗v =
− cos( φ)
The graphic representation of ⃗v can be seen in figure 5.8b.
From here, a line representing the sound wave has to be found. Firstly, we have to
make an assumption, that the soundwave hits P1 at time t = 0. Since the direction
⃗v is a normal vector to the wavefront line, we can compute the wavefront line, l,
as:
l:
sin( φ) · ( x − 0) − cos( φ) · (y − 0) = 0
⇔
l:
(5.5)
x · sin( φ) − y · cos( φ) = 0
This is illustrated in figure 5.9a.
In order to figure out the times at which the sound wave will hit P2 and P3 respectively, we have to know the distance between the sound wave and the points. The
distance between a line l : ax + by + c = 0 and a point P( x0 , y0 ) can be calculated
as:
| ax0 + by0 + c|
√
dist( P, l ) =
a2 + b2
43
5.5. Farfield DOA proof
(a) Positions of P1 , P2 , P3
(b) Microphone positions with ⃗v
Figure 5.8: The positions of microphones in an equilateral triangle (a), and the vector indicating
sound wave direction (b).
Inputting the values from equations 5.4 and 5.5, we can find the distances dist( P2 , l )
and dist( P3 , l ) as:
| − sin( φ) · D · sin( π6 ) + cos( φ) · D · cos( π6 ) + 0|
q
cos2 ( φ) + sin2 ( φ)
π
π
= D · | − sin( φ) · sin( ) + cos( φ) · cos( )|
6
6
π
= D · | cos( φ + )|
6
dist( P2 , l ) =
(5.6)
(5.7)
(5.8)
and
| sin( φ) · D · sin( π6 ) + cos( φ) · D · cos( π6 ) + 0|
q
cos2 ( φ) + sin2 ( φ)
π
π
= D · | sin( φ) · sin( ) + cos( φ) · cos( )|
6
6
π
= D · | cos( φ − )|
6
These distances are illustrated in figure 5.9b.
dist( P3 , l ) =
(5.9)
(5.10)
(5.11)
The last important step when calculating the distance to the sound wave is to
figure out whether or not the point is "behind" or "ahead of" the wave (i.e. whether
the point has already been hit at time t = 0 or not). For this the vector projection
formula can be used:
⃗a · ⃗b ⃗
⃗a⃗b =
·b
||⃗b||2
44
5.5. Farfield DOA proof
(a) The soundwave at t = 0
(b) Distance from soundwave to P2 and P3
Figure 5.9: A blue line indicating the sound wave (a), and the distances from the sound wave to the
points P2 and P3 as red dotted lines (b).
What can be gathered from this equation is that the factor
⃗a·⃗b
||⃗b||2
determines whether
⃗a is "behind" or "ahead of" ⃗b. If the factor is less than 0, ⃗a is "behind" ⃗b and
vice versa. As a mathematical representation, the distance will be negative when
"behind" ⃗v and positive otherwise.
Applying this to the vectors P2 and P3 and ⃗v, it can be seen:
and
π
P2 · ⃗v
= D · cos( φ + )
2
||⃗v||
6
(5.12)
P3 · ⃗v
π
= D · cos( φ − )
2
||⃗v||
6
(5.13)
Comparing equations 5.6 and 5.12, and equations 5.9 and 5.13, we can gather that:
Pi · ⃗v
||⃗v||2
= dist( Pi , l ),
i = [2, 3]
(5.14)
It can also be gathered that the angular distances d2 and d3 are:
(
di =
−dist( Pi , l ),
Pi ·⃗v
||⃗v||2
<0
dist( Pi , l ),
Pi ·⃗v
||⃗v||2
≥0
,
i = [2, 3]
Using the information from Equations 5.14 and 5.15, it can be seen that:
d2 = D · cos( φ +
π
)
6
(5.15)
45
5.5. Farfield DOA proof
and
π
)
6
Using this and assuming that the sound wave moves at speed c it can be seen that
the TDOA of P1 to P2 , t12 , is:
d3 = D · cos( φ −
t12 =
D
π
cos( φ + )
c
6
(5.16)
π
D
cos( φ − )
c
6
(5.17)
and the TDOA of P1 to P3 , t13 , t23 , is:
t13 =
Using these two, the TDOA of P2 to P3 can be calculated as:
t23 = t13 − t12 =
D
π
π D
cos( φ − ) − cos( φ + ) =
sin( φ)
c
6
6
c
(5.18)
It is important to note that the goal is to find the angle the sound wave came from
⃗ as a unit vector pointing
and not the angle it is going. If we define a vector w
towards the sound source, it will have an angle from the positive y-axis θ (see
figure 5.10).
⃗
Figure 5.10: Direction towards sound source w
46
5.5. Farfield DOA proof
Here it can be clearly seen that θ = φ.
Assuming that θ ∈ (−π; π ], it can be seen that equations 5.16, 5.17 and 5.18 will
always output multiple answers. It is true for a sin wave:
sin( a) = sin(π − a)
sin( a) = sin(2π + a)
From this, it can be gathered that, for a given TDOA between mics 2 and 3 t23 ,
there would be two possible angles α1 and α2 of which one is equal to θ. α1 can be
calculated as:
c · t 23
α1 = arcsin
(5.19)
D
Using the sinus identities, it can be seen that:
α2 =
π − α1 , α1 ≥ 0
− π − α1 , α1 < 0
Using this, two possible values for θ are found. These values can be inserted into
equations 5.16 and 5.17. This will yield some expected TDOA values for t12 and
t13 . The correct angle can be found by comparing these two to the actual TDOA
values. This angle will henceforth be recorded as θ23 , as it has been derived from
t23
While this may be enough in theory, the TDOAs may be inaccurate, and therefore
the angle should be based on all TDOAs instead of only getting the angle from a
single TDOA.
In order to calculate an average angle, angles have to be calculated for both t12 and
t13 . Below is an example showing how it would be done for t12 .
Firstly the equation for t12 is rewritten:
D
π
cos( φ + )
c
6
⇔
c
π
arccos( t12 ) − = φ
D
6
t12 =
(5.20)
To account for there being two solutions to the original equation, it can be found
that another solution is:
φ = − arccos(
c
11π
t12 ) +
D
6
47
5.5. Farfield DOA proof
or
c
π
t12 ) −
D
6
When these angles are found, the one which is deemed closest to θ1 (which is
the reference angle) is deemed as the correct one. This angle distance takes into
account that the angle movement is on a circle, and thus it is possible that the
shortest distance crosses the line where the angle turns positive. This is illustrated
in figure 5.11
φ = − arccos(
Figure 5.11: Example of angle distance being shorter over the line where the angle crosses from −π
to π. As clearly shown, the distance d1 is shorter.
The correct angle gathered from t12 is called θ2 . Using a similar method, θ3 can be
gathered from t13 .
The sound DOA, θ is then deemed to be the average angle between θ1 , θ2 , and θ3 ,
corrected for rotations (such that θ ∈ (−π; π ])
Chapter 6
Implementation
This chapter will first describe each hardware component used in the project. It
will then explain each software component along with how they were integrated
into ROS (Robot Operating System).
6.1
Hardware
This section will cover the hardware that will be used in the downscaled prototype.
The prototype will be based on a TurtleBot2, which is an open-source, low-cost, yet
powerful robot kit based on the iClebo Kobuki mobile robot base. The Turtlebot
comes equipped with different useful sensors and libraries.
6.1.1
Onboard Hardware
The TurtleBot2 is a vertically stacked robot based on the Iclebo Kobuki mobile
robot base. The base features odometry sensors with 52 ticks/encoder-revolution,
a gyroscope as well as bumpers, cliff sensors, and wheel drop sensors.[34]
Cliff Sensors
The Kobuki robot base is equipped with three cliff sensors on the underside of
the base. These are used to detect when the robot is approaching a steep dropoff in the operating environment. This enables the robot to detect and act on this
type of environmental obstacle before incapacitating or damaging itself. These are
installed in the front of the robot and on either side[47].
48
6.1. Hardware
49
Figure 6.1: The iClebo Kobuki mobile robot base without a tower[47].
Bumpers
Much like the cliff sensors, the Kobuki robot base comes with three bumper sensors, installed on the perimeter of the base. These are also used to detect and
navigate obstacles in the operating environment. These are used for collision detection, but can contextually be used for obstacle avoidance [47].
Encoders
The robot has another very useful component, rotary encoders. Rotary encoders
are devices that are used in a wide variety of applications that require monitoring
or control. In this case, the encoders are attached to the driving motors of the robot,
which are used to provide information about the motion of the drive shaft, which
is processed into information about the position, speed, and driving distance of
the robot base. There exist many different types of encoders but they all serve the
same purpose.
6.1.2
Additional Hardware
Microphones
Determining the DOA using the TDOA method and without artificial pinna takes
at least three microphones to do. The microphones chosen for this project are the
Behringer ECM8000 Measurement Microphone. The microphone can be seen in
6.2. These microphones are omnidirectional, meaning they can hear sound the
same from every angle, which is perfect for anomaly detection and finding the
6.1. Hardware
50
DOA from any place in space. They are also condenser microphones which are
well-suited for this project. Condenser microphones are explained in Section 2.6.1.
They also have an ultra-flat frequency response meaning the microphones capture
sounds close to an equal level for all frequencies. A frequency response of a microphone is shown in a graph with the signals frequency on the x-axis and the signals
dB level on the y-axis. The flatter the graph the less the microphone will warp or
alter the raw recording.
Figure 6.2: The Ultra-linear Behringer ECM8000 Measurement Microphone[7].
3D-printed microphone stand
To ensure precise and repeatable positioning of the microphones in the equilateral
triangle that is required by the TDOA math, a microphone stand was drawn in
Solidworks and 3D-printed using PLA. The microphone stand secures the bottom
of the microphones in place while allowing an XLR cord to be inserted from the
bottom. A sort of lid with holes is then slid on top of the neck of the microphones
and secured to the rest of the stand. The lid holds the necks of the microphones in
place so that they are all precisely 8 cm apart. A foot was also 3D-printed so that it
could be placed on top of the robot without sliding. Two pictures of the stand can
be seen in Figure 6.3.
Zoom F6 Field Recorder
The Zoom F6 is a portable field recorder designed for professional audio recording in a variety of settings. It features six inputs, each with its own high-quality
preamp and individually adjustable gain, making it well-suited for recording interviews, music performances, and sound effects in the field. The F6 has a 32-bit
51
6.1. Hardware
(a) (L) Microphone stand foot, (C) Lid, (R) Main body.
(b) Picture of the microphones inserted into the microphone stand.
Figure 6.3: Pictures os the 3D-printed microphone stand.
floating point resolution and a sampling rate of up to 192 kHz, allowing it to capture a wide range of frequencies and dynamics with high accuracy. For this project,
the field recorder has been limited to a sampling rate of 48 kHz, which is a more
than sufficient sampling rate for high-quality audio recordings. It also has built-in
timecode functionality, making it easy to synchronize audio with video recordings.
One of the standout features of the Zoom F6 is its ability to record in both mono
and polyphonic modes. In mono mode, all six inputs are combined into a single
mono track, making it ideal for recording a single source or for mixing multiple
sources together. In polyphonic mode, each input is recorded as a separate track,
allowing for greater flexibility in post-production and uses in audio source localization.
Figure 6.4: The Zoom F6 Field Recorder[45].
6.2. Motion detection program
52
Logitech C905 720p Webcam
The webcam used for this project is the Logitech C905 which is a compact 2megapixel portable webcam capable of 720p video feeds at 30 frames per second
and is able to capture 8-megapixel photos. It uses high-precision Carl Zeiss optics,
AutoFocus, and light-correcting RightLightTM 2 technology to improve image quality for optimal video streaming. It is equipped with a built-in microphone which
will not be used in this project. The webcam uses a Hi-speed USB 2.0 certified
connector to connect to a computer.
Figure 6.5: The Logitech C905 720p webcam used for motion detection [1]
6.2
Motion detection program
The motion detection program has been written in Python, partly because of the
many widely available open-source libraries and Python’s general ease of use. The
main library used for the motion detection program is OpenCV, which is one of the
largest open-source computer vision libraries and has many useful functionalities
that help to streamline the programming process. The source code of the program
can be found on the GitHub repository linked in the forward. There exists a variety of methods for doing motion detection, with each with its own strengths and
drawbacks. "Background Subtraction" was found to be the most suitable candidate
for this project. Mainly due to the ease of implementation and its capabilities. A
brief text-based walkthrough of the code can be found in the last paragraph of 5.3
A simplified flowchart of the program can be seen in 6.6.
6.2. Motion detection program
Figure 6.6: Flowchart of the Motion Detection Program
53
6.3. Anomaly Detection, TDOA and sound source locating program
6.3
54
Anomaly Detection, TDOA and sound source locating
program
As with the motion detection program, the anomaly detection, TDOA, and sound
source localization algorithm are also written in python. The main libraries put
into use in this program are Numpy, to calculate the fast Fourier transforms and
the cross-correlation, and then the library Sounddevice, to do the recordings.
The program begins by recording a two-second sound sample from all three microphones. Each microphone is numbered according to the proof from section 5.5,
and the recordings from the microphones are similarly named rec1, rec2, and rec3.
After the three simultaneous recordings have been made, an FFT is performed on
each sound sample and then compared to a noise template that has been created
earlier. How a noise template is made is explained in section 5.2.2. The program
checks for each recording if the FFT contains spikes in frequencies that lie above
the noise template. If that is the case, the microphone will have heard an anomaly.
If only one or two microphones hear anomalies, it will start over and record three
new samples, but if all three microphones hear anomalies the program will instead continue to try and locate it. To locate the sound source, the program finds
the TDOAs between all three microphones using cross-correlation as explained in
section 5.4.1. It does cross-correlation between all possible pairings of the recordings. This gives three TDOAs named t12 , t23 , and t13 . From these three TDOAs, a
DOA is calculated in accordance with the algorithm laid out in section 5.5.
A simplified flowchart of the program can be seen in Figure 6.7.
6.4
ROS
ROS (Robot Operating System) is an open-source framework designed for developing robotic applications [49]. It is a middleware solution that connects software
and hardware in robotic solutions. It has a modular structure consisting of, among
others, topics and nodes, simplifying the process of working with different sets of
pre-built hardware. Additionally, the open-source nature of ROS ensures a variety
of pre-built libraries and packages for commonly used components and robots. It
also has a wide range of support for different operating systems including Linux,
Windows, and macOS, as well as supporting the use of a variety of programming
languages, like C++ and Python. The latest release of ROS is ROS 2 - Humble
Hawksbill. This project, however, uses ROS Noetic Ninjemys for Ubuntu 20.04.
6.4. ROS
55
Figure 6.7: Flowchart of the Anomaly detection, TDOA, and sound source locating program
This section intends to, in general terms, describe how ROS works, by using nodes
and topics.
6.4.1
Topics
Topics in ROS essentially work the same way a variable does in a regular program. It holds some information that may be used or changed by some nodes.
This information is stored as a certain type; in ROS, these types are called messages. Messages can be compared to the variable types of a regular programming
language, however, with increased versatility. These can range from a simple Int8,
which includes a single 8-bit integer value, to more complicated messages, like the
6.4. ROS
56
Imu message type, which includes information about the robot’s angular orientation, velocity, and acceleration in all three dimensions.
Topics are useful since they work as a middle ground for software to interact with
one another. By utilizing topics, programs no longer have to interact directly with
one another, but can simply manipulate and read the same data.
6.4.2
Nodes
In ROS, nodes are the processes that perform the actual computation of the robot.
They can range from very simple to very complex, depending on the use. There
are two types of nodes: subscriber nodes and publisher nodes. Subscriber nodes
are subscribed to a topic, meaning they perform some operation whenever a topic
changes. Publisher nodes ’publish’ to a topic, meaning they update a topic at a
certain predetermined rate. A node can be subscribed to multiple topics, publish to
multiple topics, and simultaneously subscribe to one topic and publish to another.
6.4.3
Project ROS implementation
This project consists of 3 custom nodes and 3 custom topics. Additionally, it utilizes
the prebuilt packages for the Turtlebot 2, including both nodes and topics. When
running, the turtlebot_bringup/minimalbringup.launch file is called in order for
the program to work.
A graph of the relations between topics and nodes can be seen in figure 6.8.
The motion_detecter node
This node uses the code described in section 6.2. The node works independently
from the rest of the nodes. It reads information directly from the webcam described
in section 6.1.2, and publishes to the /motion_detection/motion_detected topic.
This topic is of the type Bool and is true if there is motion, and false if there is not.
This is implemented in order to have a value that is easy to use in any other node.
Due to the modular nature of ROS, it is possible to run multiple instances of the
motion_detecter node. This means, that another camera can easily be added to
the setup for increased vision.
6.4. ROS
57
Figure 6.8: Graph of ROS nodes (using the rqt_graph function). This can be found in a bigger version
in appendix A.
The anomaly_detection_and_tdoa node
This node uses the code described in section 6.3. Whenever the node calculates a
goal angle, it publishes to the /localization_topics/goal_angle topic. It does
not publish at a specific rate, as the program is deemed slow enough to not be a
problem. The program is estimated to publish at a rate of approx. 0.4-0.5 Hz. When
the robot is moving, the node does not publish to the /localization_topics/goal_angle
topic, sleeping for 100 ms repeatedly, until the robot has stopped moving.
The /localization_topics/goal_angle topic is of the type Float32 and represents the angle, θ, where θ ∈ (−π; π ].
Running multiple instances of this node is not possible as the code is presented in
this
The driving_controller node
When an angle has been published to the /localization_topics/goal_angle topic,
it is read by the driving_controller node. This node publishes an angular velocity to the /cmd_vel_mux/input/navi topic in order to control the rotation of the
Turtlebot. Its velocity is controlled using a proportional controller, proportional to
the angular distance to the goal angle. This a pretty simple solution, that allows
6.4. ROS
58
the robot to arrive at its target angle with high accuracy. Since it is practically impossible to arrive at the exact angle desired, the robot has a tolerance of 0.01 angle
units, corresponding to 1.8° or approx. 0.031416 rads.
Whenever the robot is moving, it publishes to the /localization_topics/moving
topic. This topic is of the type Bool and represents that the robot is in motion. This
topic has been implemented since it is simpler than checking all velocity values of
the robot.
Chapter 7
Verification and Use Case Validation
This chapter aims to document the testing and verification process of the motion
detection, sound anomaly detection, and sound source localization software.
7.1
Resource limitations
The allotted resources for this project limit the prototyping abilities; however, a
concept prototype can be constructed to test the audio and visual localization capabilities of the chosen concepts. This concept prototype will be a Turtlebot 2 robot,
with a webcam and an array of diaphragm condenser microphones. The testing of
the robot will occur in a sound lab at Aalborg University, where the limits of its
abilities will be tested.
7.2
7.2.1
Motion detection program Testing
The test cases
To test the limits of the motion detection program, a series of tests will be performed for the following scenarios:
• Control
• Multiple moving objects
• Non-humanoid-shaped objects
59
7.2. Motion detection program Testing
60
• Partial obstruction of the path
• Decreased/increased camera quality
• Varying distances (1 m, 3 m, 5 m, 10 m, 20 m)
• Varying light levels
• Varying backgrounds
These tests will be performed by moving a human-shaped object across the field of
view of the camera, and recording how many times it correctly detects the object.
If not otherwise specified, the test is performed at a distance of 3 meters and a
brightness of around 190 lux.
7.2.2
Results of motion detection tests
Each of the tests was performed 5 times. The test was executed with small variations.
• Multiple moving objects
This test was designed to test the program’s capacity to detect multiple moving objects in a single scene. The test was done using a person moving their
arms, while another person in the foreground moved from one side of the
frame to another. In all five tests, the program correctly identified the anomalies, with no false positives or false negatives.
• Non-humanoid-shaped objects
This test was designed to test the program’s capacity to detect objects of
different sizes and colors. The test was done using different items of varying
sizes ranging from a box of size 10x10x10 to an average-sized human. In
all five tests, the program correctly identified the anomalies, with no false
positives or false negatives.
• Partial obstruction of the path
This test was designed to test the program’s capacity to detect objects of
different sizes and colors. The test was done using different items of varying
sizes ranging from a black box 10x10x10 cm to an average-sized human. In
all five tests, the program correctly identified the anomalies, with no false
positives or false negatives.
7.2. Motion detection program Testing
61
• Decreased camera quality
This test was designed to test the program’s capacity to detect objects across
different camera qualities. The camera quality was artificially lowered, by
blurring the input image with a gaussian filter. The tests started with normal
camera input and were then gradually increased during the 4 other tests. In
all five tests, the program correctly identified the anomalies, with no false
positives or false negatives.
• Varying distances (1 m, 3 m, 5 m, 10 m, 20 m)
This test was designed to test the program’s capacity to detect objects across
different distances. First one meter and then increasing as the tests go on.
The program succeeded in all test cases at all distances.
• Testing at varying light levels and distances In this test, the program was
tested at the distances 1 m, 3 m, 5 m, 10 m, and 20 m. Additionally, the
distances 1 m, 3 m, and 5 m were tested at different light levels. These light
levels were:
– Well lit:
Fully lit by either full daylight or multiple light sources.
– Slightly lit:
A single light source is present.
– Poorly lit:
No light sources are lit, but light is present from an adjacent room (behind the camera).
– Dark:
No light source is present.
A table with the distance- and light-level-tests can be found in table 7.1. In this
test, the lux level was measured at the point of movement for each distance and
light level. Another table (7.2) showing the rest of the test results can be found.
7.2.3
Visual Test Conclusion
From these results, it can be concluded that the visual detection function of the
robot solution is not limited by partial obstructions, distance up to 20 meters, or
the introduction of multiple and non-humanoid anomalies. As seen in table 7.2,
no cases of either false positives or false negatives have been seen in any well-lit
environments. This can be used to conclude that the program can be considered
62
7.2. Motion detection program Testing
Distance
TP
Well lit
FP FN
Slightly lit
TP FP FN
Poorly lit
TP FP FN
TP
Dark
FP FN
1m
Lux level
15
0
30
0
15
0
11
0
15
0
8.9
0
0
0
0
15
3m
Lux level
15
0
17
0
15
0
7.8
0
10
0
3.9
5
0
0
0
15
5m
Lux level
15
0
13
0
15
0
4.1
0
0
0
0
15
0
0
0
15
10 m
Lux level
5
0
212
0
-
20 m
Lux level
5
0
181
0
-
Table 7.1: Table of motion detection at different distances and light levels
Test type
True
Positives
False
Positives
False
Negatives
Precision
Recall
Control
Multiple objects
Non-humanoid-shaped objects
Partial obstruction of the path
Decreased camera quality
Varying Backgrounds
5
5
5
5
5
5
0
0
0
0
0
0
0
0
0
0
0
0
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
Table 7.2: A table of the visual tests results
reliable when used in well-lit environments. This is to be expected as background
subtracting works at its best in a static well-lit environment. However the perfect
results might also be a consequence of non sufficient stress test of the program, so
the results should be considered with skepticism.
However as seen in table 7.1, the motion detection performance decreases when
the environment is poorly lit. This is mainly due to the hardware limitations of the
camera, not a limitation of the program. The algorithm can only detect what the
camera outputs and if the camera outputs a black image it is not possible for the
algorithm to detect anything.
Possible solutions to this problem will be explored in chapter 8.
7.3. Audio Testing
7.3
7.3.1
63
Audio Testing
Assumptions
Prior to the experiment, a number of assumptions need to be made:
1. The speed of sound is set to 343 m/sec and fluctuations in speed due to
pressure and temperature differences are ignored.
2. Only one sound source is present.
3. The sound is emitted omnidirectionally.
4. The anomaly being detected is on the same horizontal plane as the robot.
Using these assumptions a list of auditory tests has been developed that will be
tested on the system.
• Identify a single sound anomaly in the same room (Soundproof room with
low background noise)
• Identify a single sound anomaly in the same room (Soundproof room with
high background noise)
• Identify a single sound anomaly in the next room (Door open, soundproof
room with low background noise)
• Locating single sound anomaly at a low distance (2 meters) (Finding DOA in
a quiet room)
• Locating single sound anomaly at a medium distance (5 meters) (Finding
DOA in a quiet room)
• Locating single sound anomaly at a high distance (10 meters) (Finding DOA
in a quiet room)
7.3.2
Audio Testing Results
Anomaly Detection
The testing consisted of an anomaly detection portion and a DOA estimation portion. Anomaly detection was performed in a soundproof room while the DOA
estimation was done in a regular quiet room to allow us to test different distances
that the smaller soundproof room does not allow.
64
7.3. Audio Testing
For anomaly detection, each test is done at a different sensitivity level or background noise level. The sensitivity level of the program is the amplitude threshold
that must be exceeded for a certain frequency to count as an anomaly. There are
three sensitivity levels: Low, Medium, and High. The Low sensitivity level is triggered when the amplitude at a certain frequency exceeds 300% of the amplitude
of the same frequency in the noise template, while the Medium sensitivity level is
triggered at 200%, and the High sensitivity level is triggered at 130%. The noise
templates’ noise levels vary from 25 decibels (Low), to 50-55 decibels when simulating a less quiet environment (High).
Each test consisted of 100 samples (samples are 2 seconds each) throughout which a
number of sound anomalies were introduced at random. The number of anomalies
produced and the amount detected by the robot were then compared, producing
the results seen in Table 7.3.
Background Noise Level
Sensitivity True
Positives
False
Positives
False
Negatives
Precision
Recall
Low (25 decibels)
Low (25 decibels)
Low (25 decibels)
High (50-55 decibels)
High (50-55 decibels)
Low (Next room, open door)
Low (Next room, open door)
Low (Next room, open door)
Low
Medium
High
Low
Medium
Low
Medium
High
0
4
8
4
71
0
0
23
7
0
0
2
0
26
0
0
100%
83%
78%
81%
24%
N/A
100%
51%
76%
100%
100%
90%
100%
0%
100%
100%
22
19
29
17
22
0
26
24
Table 7.3: This table shows the results gathered in the anomaly detection testing
During the test of the high background noise, it was quickly established that anything above a low sensitivity would only result in almost all samples being determined as an anomaly. Therefore no test with high sensitivity was performed.
Based on the anomaly detection test results, it was concluded that in order to get
the best results, it would be neccessary to choose an appropriate sensitivity configuration depending on the environment. If the environment is quiet (25 decibels),
the best results would be acheived at a sensitivity of medium or high. Choosing
high for this environment will give the occasional false positives, but it also means
it will give a very small amount of false negatives. This should give the best opportunities to discover an intruder. In a more noisy environment, it is important
to choose a low sensitivity to avoid too many false positives. The low sensitivity
65
7.3. Audio Testing
does have the disadvantage of occasionally missing an actual anomaly, but picking
up on all anomalies in a very noisy environment will be difficult without detecting
false positives.
If the sensitivity is appropriate, the accuracy of the program is 80% with a recall
close to 100% in an ideally quiet environment. In a noisy environment the accuracy
is is 80% with a recall of around 90%.
Direction of Arrival
The room chosen as the location for the DOA testing was a large lecture room
at Aalborg University. The size of the room allowed for the testing of the DOA
function at both 2 meters and 5 meters, which was not possible in the soundproof
room, however, it also meant that there was a noticeable echo when an anomaly
was produced. This, along with occasional noise coming from the outside of the
room, should be taken into consideration when analyzing the results of the tests.
With the room chosen, 9 different angles are measured out and marked in the room
with a distance of both 2 and 5 meters. At each of these 18 points, 20 anomalies
are produced, and the direction estimated by the DOA program is noted. Any
results containing errors were ignored and uncounted, as the suspicion was that
they stemmed from either the echo or the external noise.
2 meters
Angle
(DEG)
0
30
90
100
160
180
-65
-90
-130
Average
deviation
Minimum
deviation
6.75°
19.60°
5.00°
6.10°
3.00°
8.80°
9.10°
7.20°
13.80°
1°
5°
5°
0°
3°
8°
2°
6°
4°
(1°)
(35°)
(95°)
(100°)
(157°)
(172°)
(-67°)
(-84°)
(-134°)
5 meters
Maximum
deviation
11°
21°
5°
12°
3°
9°
32°
22°
32°
(-11°)
(51°)
(95°)
(112°)
(163°)
(-171°)
(-33°)
(-112°)
(-162°)
Average
deviation
Minimum
deviation
8.45°
23.15°
5.95°
12.00°
7.00°
13.45°
10.00°
20.00°
6.00°
1°
20°
5°
12°
0°
8°
2°
14°
6°
(-1°)
(50°)
(95°)
(112°)
(160°)
(172°)
(-67°)
(-104°)
(-124°)
Maximum
deviation
20°
34°
10°
12°
30°
18°
52°
22°
6°
(-20°)
(64°)
(100°)
(112°)
(130°)
(-162°)
(-13°)
(-112°)
(-124°)
Table 7.4: A table showing the recorded angles’ deviation from the measured angle. In parentheses
is the value of the recorded angle.
As seen in Figure 7.4, the microphone array has more difficulty estimating the
direction of a sound at some angles, than it does at others. This is, for example,
7.3. Audio Testing
66
a problem at the direction 30◦ , where the average deviation is 19.6◦ at 2m, which
increases to 23.15◦ at 5m. These deviations are much higher than those measured
at 160◦ , which are only 3◦ at 2 meters and 7◦ at 5 meters. This is also the case for
the angle -90◦ , which has an average deviation of 20◦ at 5m.
These deviations seem relatively high, however, when taking into consideration
that the webcam being used has a field of view of 65◦ , deviations below 22.5◦ still
allow the visual detection system to identify the anomaly. This number comes from
dividing the FOV by 2, to get the maximum visual deviation in each direction,
subtracted by 10 to ensure the movement is within line of sight. As this was only
the case 20 times throughout all 360 tests, with 8 deviations at 2 meters and 12
deviations at 5 meters. The anomaly-producing object is within the requirements
at a success rate of roughly 95.6% at 2 meters, and 93.3% at 5 meters.
An increase in the average deviation can also be seen from the tests when going
from 2m to 5m. This is the case for all angles tested, except for those done at
-130◦ , however as the other angles point to a trend, this can be seen as an anomaly.
With this information, it can be theorized that at a certain distance, the average
deviation may increase to above 22.5◦ . This would mean that in the majority of
cases, the deviation would be too high for the robot to be able to reliably identify
the anomaly using the visual detection that would follow. With further testing, the
maximum acceptable distance could be found.
Difficulties with the solution
A problem that the group ran into while performing these tests, was the occurrence
of NaN errors. NaN stands for "Not a Number" and became an issue for the
detection program due to the expected and measured time differences for t12 and
t13 not matching. The theory as to why this became a prevalent issue under the
auditory localization tests, is that noise from external sources can interfere with the
cross-correlation and thus the measured time differences. This reasoning explains
why there was an increase in the number of NaN errors under the localization
testing, as these tests needed to be performed in a larger test area without the
anechoic properties of the chamber used for the anomaly detection tests. The
introduction of noise from external sources can make the microphones register a
noise resulting in time differences that do not make sense. This issue had varying
frequency, with no occurrences for tests at some angles, and over 60% for tests
at other angles. After analyzing the results, no obvious link between angle and
amount of NaN errors can be made without further testing.
Chapter 8
Discussion
This chapter will first discuss errors and problems encountered during testing,
along with ways to improve current methods. Lastly, it will check the tests made
in chapter 7 in accordance with the requirements made in chapter 3.
8.1
Sources of error
Since this project works with sound and vision, there are many possible sources of
error. The discovered sources are listed below. The errors associated with sound
influence the TDOA calculation algorithm and the errors associated with light and
color influence the motion detection algorithm.
8.1.1
Sound-Associated Errors
When operating outside of an anechoic chamber, sound reflection will always be
present to some degree. This can mean that a microphone may register a sound
multiple times; once directly from the sound source and again reflected from surfaces in the environment. This may happen for all microphones, or maybe only
some. This can impact the calculated TDOAs, thus impacting the calculated angle.
Additionally, the sampling rate of the sound card used is limited to 48 kHz. This
means that minute differences are more difficult to detect, as these are rounded
from the limited sampling rate.
Another problem, that has been experienced, is the problem of background noise.
This severely decreases the performance of the program, since it has problems
67
8.2. Areas of Improvement
68
differentiating between sound anomalies and background noise
8.1.2
Light and Color Associated Errors
Since the motion detection is performed using background subtraction, the algorithm is naturally more sensitive when more light is present, due to the constant
difference threshold. This results in the program being worse at noticing movement in darker environments. Additionally, simply increasing the sensitivity of
the program will make it more susceptible to noise, since noise occurs more frequently when filming dark environments
Additionally, since background subtraction is used, the robot has difficulties detecting the movement of an object if the background behind it is of a similar color.
This is especially true for darker colors, as these also make it more difficult to see
shadows
8.2
8.2.1
Areas of Improvement
Areas of Improvement in the Sound Analysis
There are multiple ways to improve sound source localization. One way is to
increase the sampling rate. This is not something that can be infinitely increased,
as it will have an effect on the program’s performance. Additionally, it is not a
solution that will fix everything.
Another possibility is to increase the distance between the microphones. This will
effectively increase the TDOAs found, which also means that the microphones
would have to be less sensitive for the same result. On the other hand, they would
be more sensitive to the same sampling rate.
Lastly a change of method in terms of sound source localization. There are many
different types of cross-correlation and only the most basic and simple type was
used in this project. It was not very robust to noise and often couldn’t identify the
sound source with a bit of noise unless the source was very clear, this is elaborated
on in Section 7.3.2. To achieve a higher noise resistance, more complex crosscorrelation methods could be implemented such as a generalized cross-correlation
method (GCC method) which is a step up from the regular cross-correlation (CC).
While the regular CC method, in simple terms, takes two signals, performs FFTs
8.2. Areas of Improvement
69
on both of them so that they are in the frequency domain, then finds the sliding
dot product between them, and at last performs an inverse FFT on the sliding dot
product to transform it back into the time domain. The GCC adds an extra step of
weighting the frequencies before they are transformed back into the time domain.
There are a couple of different weighting functions to choose from CC, ROTH,
SCOT, and PHAT, where PHAT (PHAse Transform) is the only usable in our case
[8].
Another problem is the system’s current inability to detect sound anomalies when
driving. This results in the robot essentially being unable to detect anything when
moving.
8.2.2
Areas of Improvement in Motion Detection
The main area of improvement in the motion detection algorithm is the ability of
the program to detect motion in darker environments. One possible way to do
this could be to dynamically calibrate the sensitivity of the program, depending on
the light level. This has, however, not been properly experimented with. Another
approach could be to implement another type of camera that is less impaired in
low-light conditions. One example of such a camera would be an infrared camera,
which instead detects infrared light specifically. These would however not be able
to detect non-living agents (i.e. objects with a surface temperature similar to that
of the environment. Another example of a camera more suited for low-light conditions would be a depth-sensing camera, which would be capable of 3D-mapping
the environment, and thus, be better at discovering changes despite poor light conditions. Using such cameras would however come with an increased price when
compared to an RGB camera.
Another area of improvement is the system’s current inability to track motion when
moving. A way to improve this would be to include some movement compensation
when running the motion detection program
8.2.3
Miscellaneous Areas of Improvement
Another area of improvement is the movement controller. Currently, it is a simple
proportional controller. This simplistic approach was chosen because the movement was not deemed an important part of this project. It is however still an
apparent area of improvement. One way of improving the movement controller
would be implementing a PID controller, instead of just a P controller.
8.3. Requirement Fulfillment
8.3
70
Requirement Fulfillment
To properly assess the capabilities of the prototype, the results of the multiple
rounds of testing can be compared to the requirements formulated at the start of
the project, seen in Table 4.1.
Requirement 3.1 sets a goal of 90% precision for the anomaly detection aspect of
the robot and Requirement 3.2 sets a goal of 90% recall for the same system.
• Using the most optimal noise template for a low background noise level
(medium) a precision of 83% and a recall of 100% was achieved within 100
samples. Only requirement 3.2 was reached, although it almost reached requirement 3.1.
• Using the most optimal noise template for a high background noise level
(low) a precision of 81% and a recall of 90% was achieved. Only requirement
3.2 was reached, although it almost reached requirement 3.1.
• Using the most optimal noise template for the next room, a low background
noise level (medium) a precision of 100%, and a recall of 100% were achieved.
Requirements 3.1 and 3.2 were reached.
From this it can be gathered, that the sound anomaly detection aspect does not
meet the initial requirements, except for when detecting sound in a different room
with low background noise
Requirement 3.3 sets the maximum acceptable deviation to 22.5◦ and expects a
successrate of 90%.
• At a distance of 2 meters the success rate was 95.6%, meaning it reached
requirement 3.3.
• At a distance of 5 meters the success rate was 93.3%, meaning it reached
requirement 3.3.
From this, it can be gathered, that the DOA aspect of the project has met the initially set requirement.
Requirement 3.4, sets a goal of using visual detection to identify a human
anomaly in a well-lit environment 95% of the time.
• At a distance of 1 meter, the tests yielded a 100% precision and 100% recall.
8.3. Requirement Fulfillment
71
• At a distance of 3 meters, the tests yielded a 100% precision and 100% recall.
• At a distance of 5 meters, the tests yielded a 100% precision and 100% recall.
• At a distance of 10 meters, the tests yielded a 100% precision and 100% recall.
• At a distance of 20 meters, the tests yielded a 100% precision and 100% recall.
As 100% of tests in a well-lit environment were successfully able to identify the
anomaly presented, this requirement is determined to be fulfilled.
The final requirement, Requirement 3.5, sets a goal of using visual detection to
identify a human anomaly in a poorly-lit environment 80% of the time.
• At a distance of 1 meter, the tests yielded a 100% precision and 100% recall.
• At a distance of 3 meters, the tests yielded a 66.67% precision and 100% recall.
• At a distance of 5 meters, the tests yielded a 0% precision. The recall could
not be defined.
From this, it can be concluded, that the visual detection algorithm fulfills the requirement at a distance of 1 meter, but not at a distance of 3 meters and above.
Additionally, the fact that only the precision is below expectations may be indicative of the program being too insensitive.
Chapter 9
Conclusion
The aim of this project was to create a perception system to detect humans for a
mobile robot platform. The project had the specific problem statement: "How can
a perception system be made to detect humans for a mobile robotic platform?". From this
problem statement, a concept design was made and using these two as a stepping
board, a number of requirements were set, which can be found in Table 4.1. It
can be concluded, that the product documented in this report has completed the
requirements as shown in Table 9.1.
No.
3.1
3.2
3.3
3.4
3.5
System Requirement
Precision of Sound Anomaly Detection
Recall of Sound Anomaly Detection
Sound Anomaly Identification
Human Identification in well-lit environment
Human Identification in poorly-lit environment
Status
Partly fulfilled*
Fulfilled
Fulfilled
Fulfilled
Partly fulfilled*
Table 9.1: Table showing the fulfillment status of the initially set system requirements.
*Partly fulfilled is to be understood as the requirement has been fulfilled under certain, but not all, conditions
It can be seen that three of the initial requirements have been fulfilled, and the
other two are fulfilled under certain conditions. Specifically, it has been shown
that the system made in this report is capable of detecting motion in a well-lit
environment and is mostly capable of detecting sound anomalies in both quiet and
noisy environments. It has also been shown, that the system is mostly incapable
of detecting motion in a poorly lit environment - only being capable of detecting
motion at a distance of 1 meter to the target.
72
73
Additionally, possibilities for further work have been discussed. These include:
1. Finding a more accurate way to determine the DOA of a sound anomaly possibly using a different method such as generalized cross-correlation.
2. Finding a way for the motion detection system to work in non-well-lit environments - possibly using a different type of camera.
3. Finding a way for the motion detection system and the anomaly detection
system to function despite the movement of the mobile robot base, and the
sound produced by the motors.
4. Implementing a better movement controller, in order for the robot to move in
a smoother manner.
With this further work, it is assumed that the project could reach a higher level of
accuracy when detecting anomalies, while also being generally more functional.
Bibliography
[1] url: https : / / www . amazon . com / Logitech - 960 - 000045 - 720p - Webcam C905/dp/B000RZNI4S.
[2]
3Blue1Brown. But what is the Fourier Transform? A visual introduction.
[3] About nimbo. url: http://hellonimbo.com/about/.
[4]
S. Adrián-Martínez et al. Acoustic signal detection through the cross-correlation
method in experiments with different signal to noise ratio and reverberation conditions. 2015. doi: 10.48550/ARXIV.1502.05038. url: https://arxiv.org/
abs/1502.05038.
[5]
Edge AI + Vision Alliance. Camera Selection – How Can I Find the Right Camera
for My Image Processing System? url: https://www.edge- ai- vision.com/
2019/03/camera-selection-how-can-i-find-the-right-camera-for-myimage-processing-system/.
[6]
nti audio. Fast Fourier Transformation FFT - Basics.
[7] Behringer ECM8000 målings mikrofon.
[8]
Lin Chen et al. “Acoustic Source Localization Based on Generalized Crosscorrelation Time-delay Estimation”. In: Procedia Engineering 15 (2011). CEIS
2011, pp. 4912–4919. issn: 1877-7058. doi: https://doi.org/10.1016/j.
proeng . 2011 . 08 . 915. url: https : / / www . sciencedirect . com / science /
article/pii/S1877705811024167.
[9]
Prof. Efstathiou Constantinos.
[10]
Donatello Conte et al. “An Ensemble of Rejecting Classifiers for Anomaly
Detection of Audio Events”. In: 2012 IEEE Ninth International Conference on
Advanced Video and Signal-Based Surveillance. 2012, pp. 76–81. doi: 10.1109/
AVSS.2012.9.
74
Bibliography
75
[11]
Cyrus Farivar. Security robots expand across U.S., with few tangible results. 2021.
url: https : / / www . nbcnews . com / business / business - news / security robots-expand-across-u-s-few-tangible-results-n1272421.
[12]
Basler Ag Felix Asche. Camera selection for low-light imaging. 2021. url: https:
/ / www . photonics . com / Articles / Camera _ Selection _ for _ Low - Light _
Imaging/a66942.
[13]
Alison Fields, Steven Linnville, and Robert Hoyt. “Correlation of objectively
measured light exposure and serum vitamin D in men aged over 60 years”.
In: Health Psychology Open 3 (May 2016). doi: 10.1177/2055102916648679.
[14]
Emilie Foxil. Så meget videoovervågning er der i Danmark. 2019. url: https :
//nyheder.tv2.dk/politik/2019-10-09-saa-meget-videoovervaagninger-der-i-danmark.
[15]
Global Patrol Robot Market Size and forecast. url: https://www.marketresearchintellect.
com/product/global- patrol- robot- market- size- and- forecast/?utm_
source=Designerwomen&utm_medium=127.
[16]
Aman Preet Gulati. Vehicle motion detection using background subtraction. 2022.
url: https://www.analyticsvidhya.com/blog/2022/03/vehicle-motiondetection-using-background-subtraction/.
[17]
Yi Guo and Zhihua Qu. “Coverage control for a mobile robot patrolling a
dynamic and uncertain environment”. In: Fifth World Congress on Intelligent
Control and Automation (IEEE Cat. No.04EX788). Vol. 6. 2004, 4899–4903 Vol.6.
doi: 10.1109/WCICA.2004.1343643.
[18]
Roger A. Freedman Hugh D. Young. University physics - With modern Physics.
Fifteenth edition with SI units. Pearson, 2020, pp. 1213–1221.
[19]
Stemmer Imaging. Colour Cameras. url: https : / / www . stemmer - imaging .
com/en/knowledge-base/colour-cameras/.
[20]
Indbrud I Fire Ud af ti bygge- og anlægsvirksomheder. 2019. url: https://via.
ritzau.dk/pressemeddelelse/indbrud- i- fire- ud- af- ti- bygge-- oganlaegsvirksomheder?publisherId=12604233&releaseId=13575848.
[21]
Iben Peders Isabell Bang Christensen. Antallet af anmeldte indbrud falder fortsat.
url: https://www.dst.dk/da/Statistik/nyheder- analyser- publ/nyt/
NytHtml?cid=33206.
Bibliography
76
[22]
Ui-Hyun Kim, Kazuhiro Nakadai, and Hiroshi G. Okuno. Improved sound
source localization in horizontal plane for binaural robot audition - applied intelligence. 2014. url: https://link.springer.com/article/10.1007/s10489014-0544-y.
[23]
Koshsh. 2021. url: https : / / smpsecurityrobot . com / products / robot thermal-camera/.
[24]
University of Oslo Kristian Nymoen. Quantitative Sound Analysis and the Visual Representations of Sound.
[25]
Mike Levine. The shape of things to come: Different types of microphones and when
to use them. 2022. url: https : / / www . popsci . com / reviews / types - of microphones/.
[26]
Song Li and Jürgen Peissig. “Measurement of Head-Related Transfer Functions: A Review”. In: Applied Sciences 10.14 (2020). issn: 2076-3417. doi: 10.
3390/app10145014. url: https://www.mdpi.com/2076-3417/10/14/5014.
[27]
Wenqi Li, Dehua Chen, and Jiajin Le. “Robot Patrol Path Planning Based on
Combined Deep Reinforcement Learning”. In: 2018 IEEE Intl Conf on Parallel Distributed Processing with Applications, Ubiquitous Computing Communications, Big Data Cloud Computing, Social Computing Networking, Sustainable Computing Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom).
2018, pp. 659–666. doi: 10.1109/BDCloud.2018.00101.
[28]
Hanhe Lin et al. “Online weighted clustering for real-time abnormal event
detection in video surveillance”. In: Proceedings of the 24th ACM international
conference on multimedia. 2016, pp. 536–540.
[29]
Hong Liu and Miao Shen. “Continuous sound source localization based on
microphone array for mobile robots”. In: 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2010, pp. 4332–4339. doi: 10 . 1109 /
IROS.2010.5650170.
[30]
Thomas B. Moeslund. Introduction to video and Image Processing Building Real
Systems and Applications. Springer London, 2012.
[31]
Geoffrey. Morrison. Fisheye Or Wide-Angle Lens For Travel. 2022.
[32]
Amna Rahman Zakria Qadir Abbas Z. Kouzani Muhammad Usman Liaquat
Hafiz Suliman Munawar and M. A. Parvez Mahmud. “Sound Localization
for Ad-Hoc Microphone Arrays”. In: Energies (2021). url: https : / / www .
mdpi.com/1996-1073/14/12/3446/pdf.
Bibliography
77
[33]
Guilherme Gaigher Netto. Optical flow and motion detection. 2019. url: https:
/ / medium . com / @ggaighernt / optical - flow - and - motion - detection 5154c6ba4419.
[34]
Inc. Open Source Robotics Foundation. TurtleBot2. url: https://www.turtlebot.
com/turtlebot2/.
[35]
Ata-Ur Rehman et al. “Multi-Modal Anomaly Detection by Using Audio
and Visual Cues”. In: IEEE Access 9 (2021), pp. 30587–30603. doi: 10.1109/
ACCESS.2021.3059519.
[36]
Ashutosh Saxena and Andrew Y. Ng. “Learning Sound Location from a
Single Microphone”. In: (2009). url: https :/ /cs .stanford. edu/ people /
asaxena/monaural/monaural.pdf.
[37]
Ashutosh Saxena and Andrew Y. Ng. Learningsoundlocationfromasinglemicrophone - Stanford University. 2009. url: https://cs.stanford.edu/people/
asaxena/monaural/monaural.pdf.
[38]
Kamal Sehairi, Fatima Chouireb, and Jean Meunier. “Comparative study of
motion detection methods for video surveillance systems”. In: Journal of Electronic Imaging 26.2 (2017), p. 023025. doi: 10.1117/1.jei.26.2.023025. url:
https://doi.org/10.1117%2F1.jei.26.2.023025.
[39]
Serviceforbundet: Peter Jørgensen DI overenskomst: Annette Fæster Petersen
Vagt-og Sikkerhedsfunktionærernes Landssammenslutning: Robet F. Andersen. Di - Danmarks Største arbejdsgiver- og erhvervsorganisation - dansk ... 2020.
url: https://www.danskindustri.dk/DownloadDocument?id=161749&
docid=64162.
[40]
Sound Fields: Free versus Diffuse Field, Near versus Far Field. 2020. url: https:
//community.sw.siemens.com/s/article/sound- fields- free- versusdiffuse-field-near-versus-far-field.
[41]
Danmarks Statistik. Indbrud i forretning, virksomhed mv. Q1-2022 Q2-2022 [Hele
Landet]. url: https://www.statistikbanken.dk/straf11.
[42]
Lasse NIkolaj Staun. SÅ mange indbrud bliver der begået i Danmark om året.
2020. url: https://dkr.dk/indbrud/indbrud-i-tal.
[43]
Stacy Stephens. K5. 2021. url: https://www.knightscope.com/k5/.
[44]
Adobe Stock. 360 Bilder – Bläddra Bland 117,484 Stockfoton, Vektorer Och Videor.
2022.
[45]
T-Studio. ZOOM F6 Field Recorder.
Bibliography
78
[46]
Temporal Representations: Time-Measuring Circuits. url: https://doctorlib.
info/physiology/medical/90.html.
[47]
Kobuki TurtleBot. iClebo Kobuki Robot Base. url: http://kobuki.yujinrobot.
com/about2/.
[48]
Vocal Technologies. url: https://vocal.com/echo-cancellation/spatialsampling-and-aliasing-with-microphone-array/.
[49]
Why ROS? 2022. url: https://www.ros.org/blog/why-ros/.
79
80
Appendix A
RQT_graph
Download