Uploaded by Al Imran

sensors-22-07940-v2

advertisement
sensors
Article
Fast Underwater Optical Beacon Finding and High Accuracy
Visual Ranging Method Based on Deep Learning
Bo Zhang 1 , Ping Zhong 1, * , Fu Yang 1 , Tianhua Zhou 2 and Lingfei Shen 2
1
2
*
Citation: Zhang, B.; Zhong, P.; Yang,
F.; Zhou, T.; Shen, L. Fast Underwater
Optical Beacon Finding and High
College of Science, Donghua University, Shanghai 201620, China
Key Laboratory of Space Laser Communication and Detection Technology, Shanghai Institute of Optics and
Fine Mechanics, Chinses Academy of Sciences, Shanghai 201800, China
Correspondence: pzhong937@dhu.edu.cn; Tel.: +86-137-6133-6705
Abstract: Visual recognition and localization of underwater optical beacons is an important step in
autonomous underwater vehicle (AUV) docking. The main issues that restrict the use of underwater
monocular vision range are the attenuation of light in water, the mirror image between the water
surface and the light source, and the small size of the optical beacon. In this study, a fast monocular
camera localization method for small 4-light beacons is proposed. A YOLO V5 (You Only Look Once)
model with coordinated attention (CA) mechanisms is constructed. Compared with the original
model and the model with convolutional block attention mechanisms (CBAM), and our model
improves the prediction accuracy to 96.1% and the recall to 95.1%. A sub-pixel light source centroid
localization method combining super-resolution generative adversarial networks (SRGAN) image
enhancement and Zernike moments is proposed. The detection range of small optical beacons is
increased from 7 m to 10 m. In the laboratory self-made pool and anechoic pool experiments, the
average relative distance error of our method is 1.04 percent, and the average detection speed is
0.088 s (11.36 FPS). This study offers a solution for the long-distance fast and accurate positioning of
underwater small optical beacons due to their fast recognition, accurate ranging, and wide detection
range characteristics.
Keywords: autonomous underwater vehicles; target detection; monocular vision; deep learning
Accuracy Visual Ranging Method
Based on Deep Learning. Sensors
2022, 22, 7940. https://doi.org/
10.3390/s22207940
Academic Editor: Enrico Meli
Received: 22 September 2022
Accepted: 17 October 2022
Published: 18 October 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affiliations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1. Introduction
Remotely Operated Vehicles (ROV) and Autonomous Underwater Vehicles (AUV) are
the two main forms of Unmanned Underwater Vehicles (UUV), which are crucial tools for
people exploring the deep sea [1–4]. Because of the restriction on cable length, ROVs can
only operate within a certain range and are dependent on the control platform and operator,
which limits their flexibility and prevents them from meeting the requirements for military
operations regarding concealment. While AUV overcomes the shortcomings of ROV and
has better flexibility and concealment because they do not have the limitations of cables
and operating platforms, AUV technology is gradually becoming the focus of national
marine research [5,6]. However, AUV is limited by the electromagnetic shielding of the
water column, and existing communication technologies on land are difficult to adapt to
the communication needs of the entire water column [7]. In order to exchange information
and charge its batteries, the AUV must return frequently to dock with the supply platform.
Typically, the current AUV return solution uses an inertial navigation system (INS) for
positioning over long distances, an ultra-short base-line positioning system at a distance
of about one kilometer from the docking platform, and a multi-sensor fusion navigation
method to approach the docking interface [8–10]. When the AUVs are more than 10 m away
from the docking interface, the aforementioned technique struggles to meet the demands of
precise positioning and docking [11]. Installing identification markers, such as light sources
on the AUV and the docking interface, is the current standard procedure for increasing
Sensors 2022, 22, 7940. https://doi.org/10.3390/s22207940
https://www.mdpi.com/journal/sensors
Sensors 2022, 22, 7940
2 of 21
docking accuracy. Additionally, machine vision technology is used in the underwater
environment to gather the necessary direction and distance information by sensing the
characteristics of the identification markers [12–14].
Using traditional machine vision and image processing methods, Lijia Zhong et al.
proposed a binocular vision localization method for AUV docking, using an adaptive
weighted OTSU threshold segmentation method to accurately extract foreground targets.
The method achieves an average position error of 5 cm and an average relative error of
2% within a range of 3.6 m [15]. This technique uses binocular vision to increase detection
accuracy. However, doing so increases the size of the equipment, which raises costs, reduces
the method’s potential application areas, and adds pressure on the deep water.
In recent years, the underwater positioning method of AUVs based on deep learning
has also been widely developed. Shuang Liu et al., based on the idea of MobileNet, built a
Docking Neural Network (DoNN) with a convolutional neural network, realized the target
extraction of the interface through a monocular camera, and combined the RPnP algorithm
to convert the 8-LED with a size of 2408 mm. The detection range of the docking interface
is increased to 6.5 m, and under the condition of strong noise, the average detection error
of the docking distance is only 9.432 mm, and the average detection error of the angle is
2.353 degrees [16]. Ranzhen Ren et al. used the combination of YOLO v3 and the P4P
algorithm to locate an optical beacon with a length of 28 cm at a long distance of 3 m to
15 m and used Aruco markers for positioning within a range of 3 m, realizing the visual
docking of the combination of far and near [17].
The underwater ranging algorithm combined with deep learning has been proven to
have the advantages of better real-time performance and higher detection accuracy [17–19].
However, the following problems need to be solved:
1.
2.
3.
The water pressure increases as the AUV’s navigation depth and volume increase.
Therefore, it is necessary to reduce the size of the AUV and improve the ranging
accuracy of the monocular camera. This poses a high demand for long-distance
identification of small optical beacons, and the existing target detection algorithms
are difficult to meet;
With the increase in AUV working distance and the decrease in the size of the optical
beacons, the light source characteristics of the optical beacon can only occupy a few
pixel sizes, which makes it very difficult to locate the centroid pixel coordinates of the
light source;
The traditional Perspective-n-point (PnP) algorithm for optical beacon attitude calculation has low accuracy for long-distance target pose.
In order to solve the above problems, the main contribution and core innovation of
this paper are as follows:
•
•
In order to solve the problem of long-distance target recognition of small optical beacons, YOLO V5 is used as the backbone network [20], and the Coordinate Attention
(CA) and Convolution Block Attention Mechanisms (CBAM) are added for comparison [21,22], and training is performed on a self-made underwater optical beacon data
set. It is proved that the YOLO v5 model, by adding CA has good detection accuracy
for small optical beacons when the network depth is relatively shallow and solves the
problem of difficulty in extracting small optical beacons at 10 m underwater;
In order to solve the problem of difficulty in obtaining pixel coordinates due to the
small number of pixels occupied by the feature points of small optical beacons, a
Super-Resolution Generative Adversarial Network (SRGAN) was introduced into the
detection process [23]. Then, the sub-pixel coordinates of the light source centroid
are obtained through adaptive threshold segmentation (OTSU) and the sub-pixel
centroid extraction algorithm based on Zernike moments [24,25]. It is proved that the
combination of super-resolution and sub-pixel has a good effect on the localization
of the pixel coordinates of the target light source in the case of 4-time upscaling
reconstruction of the image;
Sensors 2022, 22, 7940
•
of the pixel coordinates of the target light source in the case of 4-time upscaling reconstruction of the image;
In order to solve the problem of inaccurate calculation of the pose of small
optical
3 of 21
beacons, a simple and robust perspective-n-point algorithm (SRPnP) is used as the
pose solution method, and it is compared with the non-iterative O ( n ) solution of
• the
InPnP
order
to solve(OPnP)
the problem
of inaccurate
calculation
of the posewhich
of small
optical conproblem
and one
of the best
iterative methods,
is globally
beacons,
simple
and robust
perspective-n-point
algorithm (SRPnP) is used as the
vergent
inathe
ordinary
case(LHM)
[26–28].
pose solution method, and it is compared with the non-iterative O(n) solution of
Our
method adds an attention mechanism to the classical neural network model
the PnP problem (OPnP) and one of the best iterative methods, which is globally
YOLO convergent
V5, migrates
theordinary
detection
algorithm
applied to land to the water environment, and
in the
case(LHM)
[26–28].
improves
the precision and recall of the model. The super-resolution technology is innoOur method adds an attention mechanism to the classical neural network model
vatively
used
as an image
enhancement
method
for small
optical
and is combined
YOLO V5, migrates
the detection
algorithm
applied
to land
to thebeacons
water environment,
with
sub-pixel
method
to improve
the range and accuracy
of light
andthe
improves
thecentroid
precisionpositioning
and recall of
the model.
The super-resolution
technology
source
centroid
positioning.
Through
the
experimental
verification,
the
distance
calculais innovatively used as an image enhancement method for small optical beacons and
is combined
with
the sub-pixel
centroid
positioning
method
to improve
the rangewith
and a size
tion
error of the
proposed
method
for four-light
source
small
optical beacons
centroid
positioning.
Through the
experimental
of accuracy
88 mm ×of88light
mmsource
is 1.04%,
and the
average detection
speed
is 0.088 s verification,
(11.36 FPS). Our
the distance
calculation
error ofmethods
the proposed
method accuracy,
for four-light
source small
optical
method
is superior
to existing
in detection
detection
speed,
and detecbeacons with a size of 88 mm × 88 mm is 1.04%, and the average detection speed is 0.088 s
tion range and provides a feasible and effective method for monocular visual ranging of
(11.36 FPS). Our method is superior to existing methods in detection accuracy, detection
underwater optical beacons.
speed, and detection range and provides a feasible and effective method for monocular
visual ranging of underwater optical beacons.
2. Experimental Equipment and Testing Devices
2. Experimental
Equipment
and Testingoptical
Devicesbeacon images between 1 and 10 m deep
For the experiments,
underwater
For the experiments,
underwater
optical
beaconofimages
between
1 and beacons
10 m deep
were gathered.
There are two
different
categories
underwater
optical
(Figure
were
gathered.
There
are
two
different
categories
of
underwater
optical
beacons
(Figure
1):
1): cross beacons made of four light-emitting diodes (LEDs) and cross optical
communicross beacons made of four light-emitting diodes (LEDs) and cross optical communication
cation probes made of four laser diodes (LDs). The wavelengths of the two optical beacon
probes made of four laser diodes (LDs). The wavelengths of the two optical beacon light
light
sources are both 520 nm and 450 nm because these two wavelengths of light have
sources are both 520 nm and 450 nm because these two wavelengths of light have the least
thepropagation
least propagation
attenuation
seawater
[29].between
The distance
between
adjacent
attenuation
in seawater in
[29].
The distance
two adjacent
lighttwo
sources
light
sources
is
88
mm,
and
the
diagonal
light
source
spacing
is
125
mm.
is 88 mm, and the diagonal light source spacing is 125 mm.
LEDs
88 mm
(a)
LDs
88 mm
(b)
Figure
1.1.Optical
A cross-light
cross-lightbeacon
beacon
composed
of three
520LEDs
nm LEDs
and
one
Figure
Opticalbeacons:
beacons: (a)
(a) A
composed
of three
520 nm
and one
450
nm450 nm
LED;
(b)(b)ananoptical
probecomposed
composed
of four
nm laser
diodes.
LED;
opticalcommunication
communication probe
of four
520 520
nm laser
diodes.
Sony322
322 camera
took
pictures
and and
was mounted
in a waterproof
compartment,
AASony
camerathat
that
took
pictures
was mounted
in a waterproof
compartalong
with
an
optical
communication
probe,
was
used
to
conduct
underwater
experiments
ment, along with an optical communication probe, was used to conduct underwater exin the range of up to 10 m in an anechoic pool, as shown in Figure 2. A Cannon D60 was
periments in the range of up to 10 m in an anechoic pool, as shown in Figure 2. A Cannon
used in a temporary lab pool to capture the high-resolution optical beacon images at a
D60
was used in a temporary lab pool to capture the high-resolution optical beacon imrange of 3 m, because a super-resolution dataset was needed.
ages at a range of 3 m, because a super-resolution dataset was needed.
Sensors 2022, 22, x FOR PEER REVIEW
4 of 21
Sensors
Sensors 2022,
2022, 22,
22, 7940
x FOR PEER REVIEW
of 21
44of
CCD
Waterproof Bin
Installed Device
Waterproof
Bin
Device
Figure 2. CCD
Underwater image acquisition
device
combinedInstalled
with optical
communication probe.
Figure 2. Underwater image acquisition device combined with optical communication probe.
Figure
2. Underwater
image
acquisition
device combined
with optical
communication
probe.
In the
underwater
ranging
experiment,
the detection
probe
and the probe
to be meas-
In the
underwater
ranging experiment,
theand
detection
probe
and the
be mea- test
ured are
fixed
on two high-precision
lifting
slewing
devices
onprobe
two to
precision
In the underwater ranging experiment, the detection probe and the probe to be meassured
are
fixed
on
two
high-precision
lifting
and
slewing
devices
on
two
precision
platforms. The lifting rod can move in three dimensions and have its end rotate test
360 deured are fixed
on two
high-precision
lifting
and slewing
devices
on two
precision
test
platforms.
The lifting
rod
can move
in three
dimensions
and
have
its end
rotate
360 degrees
grees
at
the
same
time
(Figure
3).
The
accuracy
of
the
three-dimensional
movement
of the
platforms.
The
rod
can
in three
dimensions
and havemovement
its end rotate
360
deat
the same
timelifting
(Figure
3).and
Themove
accuracy
of the
three-dimensional
of the
transtranslation
stage
is
1
mm,
the
rotation
error
is
0.1
degree.
In
the
experiment,
two
grees at
the same
timeand
(Figure
3). The accuracy
of degree.
the three-dimensional
movement
of the rolation
stage
is 1 mm,
the rotation
error is 0.1
In the experiment,
two rotating
tating
lifting rods
will
control
the
to error
descend
m below
water
surface
and
translation
is 1 mm,
thedevice
rotation
is below
0.13 degree.
Inthe
thesurface
experiment,
two
ro-keep
lifting
rods stage
will control
theand
device
to descend
3m
the water
and keep
the
thetating
depth
constant.
The
devices
only
change
the
rotation
angle
and
the
relative
position
lifting rods
will
control
the change
device to
m below
surface
andofkeep
depth constant.
The
devices
only
thedescend
rotation3angle
andthe
thewater
relative
position
the
of same
the depth
same
plane.
the
constant.
The
devices
only
change
the
rotation
angle
and
the
relative
position
plane.
of the same plane.
(a)(a)
(b)(b)
Figure
3.
Deviceinstallation
installation diagram: (a)
The detection
device
is is
fixed
to the
lift lift
bar bar
andand
drops
to 3 to 3
Figure
3. 3.
Device
device
fixed
the
drops
Figure
Device installationdiagram:
diagram: (a)
(a) The
The detection
detection device
is fixed
totothe
lift bar
and drops
to
m
underwater;
(b)
The
device
to
be
tested
is
fixed
to
a
lift
rod
on
the
other
side
and
lowered
to
3
m
m underwater;
(b)
The
device
to
be
tested
is
fixed
to
a
lift
rod
on
the
other
side
and
lowered
to 3 m
3 m underwater; (b) The device to be tested is fixed to a lift rod on the other side and lowered to
underwater.
underwater.
3 m underwater.
The devices only change the rotation angle and the relative position of the same
Thedevices
devices only
only change
rotation
angle
and the
positionposition
of the same
The
changethethe
rotation
angle
andrelative
the relative
of plane.
the same
plane. Figure 4 is a top view of the experimental conditions for image acquisition in this
Figure
4 is a 4top
view
of
the experimental
conditions
for imagefor
acquisition
in this paper.
plane.
Figure
is
a
top
view
of
the
experimental
conditions
image
acquisition
paper. The experiment was carried out in an anechoic tank with a width of 15 m andina this
The
experiment
was carried carried
out in anout
anechoic
tank with atank
width
of 15
m and of
a length
paper.
The
experiment
in an anechoic
with
a width
m of
and a
length
of 30
m. Ensure was
that the device
equipped
with the camera
is stationary,
the15device
30 m. Ensure that the device equipped with the camera is stationary, the device to be tested
length
of
30
m.
Ensure
that
the
device
equipped
with
the
camera
is
stationary,
the
device
to fixed
be tested
is mobile
fixed onplatform
the mobile
and moves
along
theonly
z-axis,
and onlyinfineis
on the
andplatform
moves along
the z-axis,
and
fine-tuning
the
to tuning
be tested
is x-axis
fixed direction
on the mobile
platform
and
moves
along
the in
z-axis,
andfield
onlyoffinein
the
makes
the
optical
beacon
feature
appear
the
CCD
x-axis direction makes the optical beacon feature appear in the CCD field of view.
Initially,
tuning
inInitially,
the
x-axis
direction
makes
the
optical
beacon
feature
inwere
the
field of
view.
thewere
two separated
platforms
were
separated
by 10
m,collected
andappear
images
collected
the
two
platforms
by
10 m,
and images
were
every
1 m.CCD
every
1
m.
view. Initially, the two platforms were separated by 10 m, and images were collected
every 1 m.
Sensors 2022, 22, x FOR PEER REVIEW
5 of 21
Sensors 2022,
2022, 22,
22, 7940
x FOR PEER REVIEW
Sensors
55of
of 21
21
Moving Stage
15.0 meters
15.0 meters
Moving Stage
LD Probe
(to be test)
LD Probe
(to be test)
Fixed Camera
Fixed Camera
Experiment pool
30.0 meters
Experiment pool
30.0(top
meters
Figure 4. The condition of the experiment
view).
Figure
4. The
The condition
condition
ofBeacon
the experiment
experiment
(top
view). and Light Source Centroid Location
Figure
4.
of
the
view).
3.
Underwater
Optical
Target(top
Detection
Method
3.
Underwater Optical Beacon Target Detection and Light Source Centroid
3. Underwater Optical Beacon Target Detection and Light Source Centroid Location
Location
Method
Figure
5 depicts our method’s overall workflow: The algorithm’s input is the underMethod
waterFigure
optical5 beacon
capturedoverall
by the calibrated
The target isinput
detected
using
depictsimage
our method’s
workflow:CCD.
The algorithm’s
is the
unFigure
5
depicts
our
method’s
overall
workflow:
The
algorithm’s
input
is
the
underYOLO
V5
CA,
and
the
target
image
is
then
fed
into
SRGAN
for
4×
super-resolution
enderwater optical beacon image captured by the calibrated CCD. The target is detected
water YOLO
opticalAfter
beacon
image
by the
calibrated
CCD.
The
target
detected
using
hancement.
obtaining
the
subpixel
quality
coordinates
of the
light source
using
V5
CA,
and
thecaptured
target
image
is
then
fed
into
SRGAN
for 4feature
×issuper-resolution
YOLO
V5
CA,
and
the
target
image
is
then
fed
into
SRGAN
for
4×
super-resolution
enusing the adaptive
threshold
segmentation
Zernike of
sub-pixel
edge
detection,
enhancement.
AfterOTSU
obtaining
the subpixel
quality and
coordinates
the feature
light
source
hancement.
After
obtaining
the
subpixel
quality
coordinates
of
the
feature
light
source
scale recovery
is used
to obtain
the segmentation
image coordinates
of the feature
points
the
using
the adaptive
OTSU
threshold
and Zernike
sub-pixel
edgewithin
detection,
usingrecovery
theimage
adaptive
OTSU
threshold
and to
Zernike
sub-pixel
edgewithin
detection,
original
size.
ThetoSRPnP
algorithm
then used
determine
thepoints
target’s
pose and
scale
is used
obtain
the segmentation
imageis coordinates
of
the feature
the
scale recovery
usedThe
to SRPnP
obtain the
image coordinates
feature points
withinpose
the
distance.
original
image is
size.
algorithm
is then usedoftothe
determine
the target’s
original
image size. The SRPnP algorithm is then used to determine the target’s pose and
and
distance.
distance.
Add
Output
Add
Noise 1
Output
Scale Recovery
SRPnP
Scale Recovery
SRPnP
Noise 1
Noise 2
Noise 2
Target
Input image
YOLOv5_CA
Prediction
SRGAN x4
Feature point extraction
Target
Input image
Prediction
Feature point extraction
x4
Figure
5.
Flow chart
chart
of visual
positioningSRGAN
algorithm.
Figure
5. Flow
of
visual positioning
algorithm.
YOLOv5_CA
3.1.
Underwater
Target
Detection
Methodalgorithm.
Based on YOLO V5
Figure
5. Flow chart
of visual
positioning
3.1.
Underwater
Target
Detection
Method Based on YOLO V5
It is very challenging to locate and extract underwater targets because of the comIt is very challenging to locate and extract underwater targets because of the com3.1. Underwater
Target Detection
Method Based
on YOLO
plexity
of the underwater
environment,
including
theV5
light deflection caused by water
plexity of the underwater environment, including the light deflection caused by water
flow, Itthe
reflection
formed
by
the
light
water
surface
and
the waterproof
chamber,
and
the
is very
challenging
andwater
extractsurface
underwater
targets
because
of the
comflow, the
reflection
formedto
bylocate
the light
and the
waterproof
chamber,
and
environmental
disturbances.
YOLO
V5
(You
Only
Look
Once),
as
a
fast
and
accurate
target
plexity
of the underwater
environment,
including
theLook
lightOnce),
deflection
caused
water
the
environmental
disturbances.
YOLO V5
(You Only
as a fast
andby
accurate
recognition
neural network,
is used
in the
targetsurface
detection
stage.
flow,
the
reflection
formed
by
the
light
water
and
the
waterproof
chamber,
and
target recognition neural network, is used in the target detection stage.
YOLO V5 follows
the main structure
regression
classification
YOLO
the environmental
disturbances.
YOLO V5and
(You
Only Look
Once), as amethod
fast andof
accurate
V4
[30].recognition
However, it
also includes
including
target
neural
network,the
is most
used recent
in the techniques,
target detection
stage.CIoUloss, adaptive
anchor box calculation, and adaptive image scaling, to help the network converge more
quickly and detect targets with greater accuracy during training. And the extraction of
Sensors 2022, 22, 7940
6 of 21
overlapping targets is also better than that of YOLO V4. The most important thing is that
it reduces the size of the model to 1/4 of the original model, which meets the real-time
detection requirements of underwater equipment in terms of detection speed. The loss
function used by YOLO V5 for training is
Loss =
CIoUloss
S
2
Bn
obj
+ ∑ ∑ Iij [Ci log(Ci ) + (1 − Ci ) log(1 − Ci )]
i =0 j =0
S
2
Bn
noobj
+ ∑ ∑ Iij
[Ci log(Ci ) + (1 − Ci ) log(1 − Ci )]
i =0 j =0
S2 Bn obj
j
+ ∑ ∑ Iij
p i (c) log
∑
i =0 j =0
c∈classes
(1)
p j i (c) + 1 − p j i (c) log 1 − p j i (c)
In Formula (1), S is the number of grid cells into which the input image is divided, Bn
is the number of anchor boxes, CIoUloss is the loss of the bounding box. The second and
third terms are confidence loss, which means the confidence of the bounding box containing
obj
noobj
the object Iij and the confidence of the bounding box without object Iij . The fourth
term is the cross-entropy loss. When the jth anchor frame of the ith grid is responsible
for the prediction of a real target, the bounding box generated by this anchor frame is
only involved in the calculation of the classification loss function. Assuming that A is the
prediction box and B is the real box, let C be the minimum convex closed box containing A
and B, then the intersection over union (IoU) of the real box and the prediction box and the
loss function CIoUloss are calculated as follows:
IoU =
CIoU = IoU −
where
| A ∩ B|
,
| A ∪ B|
Distance_Center2
v2
−
,
2
(1 − IOU ) + v
Distance_Corner
(2)
(3)
4
w 2
ŵ
,
v = 2 arctan − arctan
h
π
ĥ
CIoUloss = 1 − CIoU.
In Formula (3), Distance_Center is the Euclidean distance between the center points of
A box and B box, Distance_Corner is the diagonal length of the box, and v is a parameter
to measure the consistency of the aspect ratio between the predicted box and the real box.
3.2. YOLO V5 with Attention Modules
Fast object detection and classification are advantages of the YOLO V5n model, but
this is due to a reduction in the number and width of network layers, which also has the
drawback of lowering detection accuracy. The application of attention mechanisms in
machine vision enables the network model to ignore irrelevant information and focus on
important information. The attention mechanisms can be divided into spatial domains,
channel domains, mixed domains, etc. Therefore, this paper adds the CBAM attention
mechanism and the coordinate attention mechanism to the YOLO V5n model, trains it on
the self-made underwater light beacon dataset, and compares the validation set loss, recall
rate, and mean average precision (Map). It is proven that the CA attention mechanism has
a good improvement over the lightweight YOLO V5 model and realizes the long-distance
accurate identification and detection of small optical beacons in the water environment.
As seen in Figure 6a, the Convolutional Block Attention Module (CBAM) is a module
that combines spatial and channel attention mechanisms. The input feature image first
passes through a residual module, and then performs global maximum pooling (GMP)
Sensors 2022, 22, 7940
of the two feature maps, normalization is performed to obtain the attention feature that
combines channel and space.
The channel feature map contains the local area information in the original image
after several convolutions. The global feature information of the original image cannot be
obtained because only the local information is taken into account when the maximum
and
7 of 21
average values of the multi-channels at each position are used as weighting operations.
This issue is effectively resolved by the coordinate attention mechanism, as seen in Figure
6b:and
theglobal
feature
map passing
through
the residual
module uses the
pooling
kernel
of sumaverage
pooling (GAP)
to obtain
two one-dimensional
feature
vectors,
which
mation
along
the
horizontal
coordinate
(H,
1)
and
vertical
coordinate
(1,
W)
pairs
respecare then subjected to convolution, ReLU activation, and 1 × 1 Convolution, weighted
tively.
Each features
channelafter
is encoded
to obtain
features
twoand
spatial
orientations.
with input
normalization
usingthe
Batch
Normalof
(BN)
Sigmoid,
to obtain This
featureallows
maps for
attention.
Thethe
channel
feature
map performs
GAP
and GMP
on and
method
thechannel
network
to obtain
feature
information
in one
spatial
direction
all
channels
at
each
pixel
position
to
obtain
two
feature
maps,
and
after
7
×
7
convolutions
save the other spatial position information, which helps the network to locate the target
the twomore
feature
maps, normalization is performed to obtain the attention feature that
ofof
interest
accurately.
combines channel and space.
Input
Input
Residual
𝑪×𝑯×𝑾
GAP+GMP
Conv+ReLU
1×1 Conv
Sigmoid
Re-weight
𝑪×𝑯×𝑾
Channel Pool
7×7 Conv
Re-weight
Output
BN+Sigmoid
𝑪×𝑯×𝑾
(a)
Residual
𝑪×𝟏×𝟏
𝑪/𝒓 × 𝟏 × 𝟏
Channel
Attention
𝑪×𝟏×𝑯
X Avg Pool
𝟏×𝑯×𝑾
Y Avg Pool
Concat+Conv2d
𝑪×𝟏×𝟏
𝟐×𝑯×𝑾
𝑪×𝑯×𝑾
BatchNorm+Non-linear
Split
𝑪×𝑯×𝟏
Spatial
Attention
𝑪×𝑯×𝟏
Re-weight
Conv2d
Conv2d
Sigmoid
Sigmoid
𝑪×𝟏×𝑾
𝑪/𝒓 × 𝟏 × (𝑾 + 𝑯)
𝑪 × 𝟏 × (𝑾 + 𝑯)
𝑪×𝟏×𝑾
𝑪×𝟏×𝑾
𝑪×𝑯×𝑾
Output
(b)
Figure
mechanism:
Convolutional
attention
mechanism;
Figure6.6.Diagram
Diagram of
of attentional
attentional mechanism:
(a) (a)
Convolutional
blockblock
attention
mechanism;
(b) Coor-(b) Coordinate
attention
mechanism.
dinate attention mechanism.
channel feature
map contains
theYOLO
local area
in theattention
original image
AThe
schematic
representation
of the
V5 information
structure with
modules is
after
several
convolutions.
The
global
feature
information
of
the
original
image
be
shown in Figure 7. Two structures were created to fit the CBAM and CA cannot
characteristics.
obtained because only the local information is taken into account when the maximum and
Four C3 convolution modules with different depths in the backbone network are replaced
average values of the multi-channels at each position are used as weighting operations. This
byissue
CBAM
with the same input and output to address the issue that multiple convolutions
is effectively resolved by the coordinate attention mechanism, as seen in Figure 6b:
cause
CBAM
to passing
lose local
information.
This
allows
CBAM
to obtain
weights
for channel
the feature
map
through
the residual
module
uses
the pooling
kernel
of summation
and
spatial
feature
maps
at
various
depths.
Due
to
the
lightweight
of
the
CA
modules,
along the horizontal coordinate (H, 1) and vertical coordinate (1, W) pairs respectively.
they
arechannel
addedistoencoded
the positions
of the
thefeatures
4-feature
mapspatial
fusion,
so that theThis
model
can better
Each
to obtain
of two
orientations.
method
allows
theweight
network
obtain the
featuremaps.
information in one spatial direction and save the
obtain
the
oftodifferent
feature
other spatial position information, which helps the network to locate the target of interest
more accurately.
A schematic representation of the YOLO V5 structure with attention modules is shown
in Figure 7. Two structures were created to fit the CBAM and CA characteristics. Four
C3 convolution modules with different depths in the backbone network are replaced by
CBAM with the same input and output to address the issue that multiple convolutions
cause CBAM to lose local information. This allows CBAM to obtain weights for channel
and spatial feature maps at various depths. Due to the lightweight of the CA modules, they
are added to the positions of the 4-feature map fusion, so that the model can better obtain
the weight of different feature maps.
Sensors
2022,
22,
Sensors
Sensors2022,
2022,22,
22,x7940
xFOR
FORPEER
PEERREVIEW
REVIEW
of21
21
88 8of
of
21
CBAM
CBAM
CBAM
CBAM
CBAM
CBAM
CBS
CBS C3
CBS CBS
CBS C3
C3 CBS
C3 CBS
CBS C3
C3
CBAM
CBAM
Convolutional
ConvolutionalBlocks
BlocksAttention
AttentionModule
Module
CBS
CBS C3
SPPF CBS
CBS
C3 SPPF
Upsample
Upsample
CA
CA
RGB
RGBimage
imageinput
input
Concat
Concat C3
C3 CBS
CBS
Upsample
Upsample
CA
CA
CBS
CBS ==
Conv
SiLU
Conv BN
BN SiLU
C3
C3 ==
CBS
CBS
SPPF
SPPF ==
CBS
CBS
Res
Res
Res
Res ==
Concat
Concat
CBS
CBS
CBS
CBS CBS
CBS
Concat
Concat C3
C3
CA
CA
MaxPool
MaxPool
CBS
CBS
CBS
CBS
CBS
CBS
Concat
Concat C3
C3
CA
CA
MaxPool
MaxPool
CBS
CBS
CBS
CBS
Coordinate
Coordinate
Attention
Attention
Module
Module
add
add
Concat
Concat C3
C3
MaxPool
MaxPool
Concat
Concat
CBS
CBS
CBS
CBS
output
output
Figure
Figure
7.YOLO
YOLO
v5
structure
diagramwith
withattention
attentionmodules.
modules.
Figure7.
7.
YOLOv5
v5structure
structurediagram
diagram
with
attention
modules.
In
In
Figure
8,
the target
target is
is the
the object
object picked
picked by
by the
the red
red boxes,
boxes, and
and the
the mirror
mirror image
image of
of
InFigure
Figure8,
8, the
the
target
is
the
object
picked
by
the
red
boxes,
and
the
mirror
image
of
the
target
and
the
water’s
surface,
as
well
as
the
laser
point
for
optical
communication,
are
the
well
asas
thethe
laser
point
for for
optical
communication,
are
the target
target and
andthe
thewater’s
water’ssurface,
surface,asas
well
laser
point
optical
communication,
selected
by
boxes.
Because
the
chosen
by
box
during
image
acare selected
by blue
the
boxes.
Because
the
objects
chosen
by blue
the
during
image
selected
by the
the
blueblue
boxes.
Because
the objects
objects
chosen
by the
the
blueblue
boxbox
during
image
acquisition
are
noise
and
will
with
of
labels
are
divided
acquisition
noise
will
interfere
with
the extraction
of the the
target,
the
labels
are
quisition
areare
noise
andand
willinterfere
interfere
withthe
theextraction
extraction
ofthe
thetarget,
target,
the
labels
are
divided
into
target
and
noise.
This
makes
ititnecessary
to
between
and
divided
into
target
and
noise.
This
makes it necessary
to distinguish
between
noise feaand
into
target
and
noise.
This
makes
necessary
todistinguish
distinguish
betweennoise
noise
andtarget
target
features
using
neural
network
training.
7606
images
of
LED
and
LD
optical
beacons
with
target
features
using
neural
network
training.
7606
images
of
LED
and
LD
optical
beacons
tures using neural network training. 7606 images of LED and LD optical beacons with
various
attitudes
were
in
underwater
environment
ranging
from
m
with
various
attitudes
were
collected
the underwater
environment
ranging
m
various
attitudes
were collected
collected
in the
thein
underwater
environment
ranging
from 11from
m to
to110
10
to 10
of which
5895
training
set data,
were
set
data.
Three
models
m,
of
which
5895
training
set
and
1711
were
test
set
Three
models
of
m,
of m,
which
5895 were
werewere
training
set data,
data,
and and
17111711
were
testtest
set data.
data.
Three
models
of
of YOLO
V5n,
YOLO
V5_CBAM,
YOLO
V5_CA
are trained
separately
on3090
RTX with
3090
YOLO
V5n,
YOLO
V5_CBAM,
and
YOLO
V5_CA
are
separately
on
YOLO
V5n,
YOLO
V5_CBAM,
andand
YOLO
V5_CA
aretrained
trained
separately
onRTX
RTX
3090
with
a 128-batch
size500
and
500 epochs.
To prevent
overfitting
of
the model,
anstop
early
stop
awith
size
epochs.
To
overfitting
of
an
mecha 128-batch
128-batch
sizeand
and
500
epochs.
Toprevent
prevent
overfitting
of the
themodel,
model,
anearly
early
stop
mechmechanism
was
introduced
during
the
training
where
YOLO
V5n
stopped
at
496,
and
anism
was
introduced
during
the
training
where
YOLO
V5n
stopped
at
496,
and
YOLO
anism was introduced during the training where YOLO V5n stopped at 496, and YOLO
YOLO
V5_CA
stopped
at
492.
V5_CA
stopped
at
492.
V5_CA stopped at 492.
11m
m
22m
m
33m
m
44m
m
55m
m
66m
m
77m
m
88m
m
99m
m
Figure8.
8.
Picturesof
ofLED
LEDand
and
LD
optical
beacons
collected
within
1–10
m.
Figure
Figure
8.Pictures
Pictures
of
LED
andLD
LDoptical
opticalbeacons
beaconscollected
collectedwithin
within1–10
1–10m.
m.
An essential
essential
parameter to
to gauge
gauge
the
discrepancy
between the
the predicted
predicted
value
and
An
An
essential parameter
parameter
to
gauge the
the discrepancy
discrepancy between
between
the
predicted value
value and
and
the
true
value
is
the
loss
of
the
validation
set.
Within
500
training
rounds,
all
three
of
the
the
the true
true value
value isis the
the loss
loss of
of the
the validation
validation set.
set. Within
Within 500
500 training
training rounds,
rounds, all
all three
three of
of the
the
models
in
Figure
8
had
reached
convergence.
In
Figure
9a,
the
class
loss
of
the
model
with
models
convergence.
class
modelsin
inFigure
Figure88had
hadreached
reached
convergence.In
InFigure
Figure9a,
9a,the
the
classloss
lossof
ofthe
themodel
modelwith
with
4
−4 , and the model with CBAM
−−44
theCA
CAmodule
moduleisis 4.96
×1010−−4−
4 , the initial model is 5.51 × 10
4.96
×
5.51
×
10
the
,
the
initial
model
is
,
and
the
model
with
CBAM
the CA module
is 4.96 × 10 , the initial model is 5.51× 10 , and the model with CBAM−isis
3,
is 6.13 ×−−4410−4 . In Figure 9b, the object loss of the model with the CA module is 8.65 ×
−3 10
6.13
loss
isis 8.65
6.13××10
10 ..In
8.65××10
10−3,,the
InFigure
Figure9b,
9b,the
theobject
object
lossof
ofthe
themodel
modelwith
withthe
theCA
CAmodule
module
the
−
3
−
3
the initial model is 8.69 ×−−310
, and the model with CBAM is 8.87 × 10
−−33 . According to the
initial
model
is 8.69
and
the
model
with
CBAM
.. According
to
8.69××10
10 3,, the
8.87××10
10with
initial
model
andCA
themodule
modelhas
with
CBAM is
is 8.87
According
to the
the
above data,
theismodel
with
improved
compared
the initial model
above
data,
the
model
with
the
CA
module
has
improved
compared
with
the
initial
model
above
data,
the
model
with
the
CA
module
has
improved
compared
with
the
initial
model
in both target detection and classification, but the model with CBAM has degenerated.
in
inboth
bothtarget
targetdetection
detectionand
and classification,
classification,but
butthe
the model
modelwith
withCBAM
CBAM has
hasdegenerated.
degenerated.
Sensors
2022,
22,
xx FOR
REVIEW
Sensors2022,
2022,22,
22,7940
FOR PEER
PEER REVIEW
Sensors
13
13
9 of
21
9 of
21 21
9 of
-4
-3
-4
10
10
YOLOv5n
v5n
YOLO
+CBAM
+CBAM
+CA
+CA
12
12
10.5
10.5
11
11
Object Loss
Object Loss
Classes Loss
Loss
Classes
YOLO
YOLO
v5nv5n
+CBAN
+CBAN
+CA
+CA
1010
10
10
9.5
9.5
99
88
0.008867
0.008867
99
0.000613
0.000613
77
0.000551
0.000551
66
8.5
8.5
55
00
-3
1010
0.008686
0.008648
0.008648 0.008686
0.000496
0.000496
50
50
100
150
200
250
250
Epoch
Epoch
300
300
350
350
400
400
450
450
500
500
00
5050
100
100 150
150 200
200 250
250 300
300 350350 400400 450450 500500
Epoch
Epoch
(a)
(b)
(b)
Figure
setset
object
loss.
Figure 9.
9. Verification
Verificationset
setloss:
loss:(a)
(a)Verification
Verificationset
setclassification
classificationloss;
loss;(b)
(b)Verification
Verification
object
loss.
Figure
9.
Verification
set
loss:
(a)
Verification
set
classification
loss;
(b)
Verification
set object
loss.
The
ofof
the
classiThe mean
mean of
of average
averageprecision
precision(mAP),
(mAP),
which
crucial
measurement
the
classiThe
mean
of
average
precision
(mAP),which
whichisis
isa aacrucial
crucialmeasurement
measurement
of
the
classification
accuracy
of
the
model,
refers
to
the
average
prediction
accuracy
of
each
type
ofof
fication accuracy
accuracy of
of the
the model,
model, refers
refers to
to the
the average
average prediction
prediction accuracy
accuracy of
of each
each type
type
of
fication
target
entire
dataset.
The
recall
isisa is
indicator
toto
target and
and
then
the
average
value
the
entire
dataset.
The
recall
acrucial
crucial
indicator
target
and then
thenthe
theaverage
averagevalue
valueofof
ofthe
the
entire
dataset.
The
recall
a crucial
indicator
determine
whether
the
object
has
examined.
As
ininFigure
10a,
the
maxdetermine
whether
thetarget
target
object
hasbeen
beenbeen
examined.
Asshown
shown
Figure
10a,
the
maxto
determine
whether
the
target
object
has
examined.
As shown
in Figure
10a,
the
imum
mAP
of
the
network
with
the
CA
module
is
96.1%,
the
maximum
mAP
of
the
netimum mAPmAP
of the
withwith
the CA
is 96.1%,
the maximum
mAPmAP
of the
maximum
of network
the network
the module
CA module
is 96.1%,
the maximum
ofnetthe
work
CBAM
isis 93.9%,
and
the
maximum
mAP
ofofthe
original
YOLO
V5n
model
is is
network
with
CBAM
is 93.9%,
maximum
mAP
of
original
YOLO
V5n
model
work with
with
CBAM
93.9%,
andand
thethe
maximum
mAP
thethe
original
YOLO
V5n
model
94.6%.
It
be
seen
that
the
CA
module
improves
the
classification
accuracy
byby
1.5%,
is
94.6%.
It
can
seen
that
the
CA
module
improves
the
classification
accuracy
by
1.5%,
94.6%.
It can
can
bebe
seen
that
the
CA
module
improves
the
classification
accuracy
1.5%,
while
the
CBAM
reduces
the
classification
accuracy
by
0.7%.
The
reason
why
the
CBAM
while
while the
the CBAM
CBAM reduces
reduces the
the classification
classification accuracy
accuracy by
by 0.7%.
0.7%. The
The reason
reason why
why the
the CBAM
CBAM
module
object
and
the
module
causes
network
degradation
that
the
long-distance
optical
signal
object
and
the
module causes
causesnetwork
networkdegradation
degradationisis
isthat
thatthe
thelong-distance
long-distanceoptical
opticalsignal
signal
object
and
the
mirror
target
in
the
training
image
have
high
consistency,
the
target
scale
is
small,
and
the
mirror
target
in
the
training
image
have
high
consistency,
the
target
scale
is
small,
and
the
mirror target in the training image have high consistency, the target scale is small, and the
global
features
cannot
effectively
separate
the
noise
and
the
target.
global
the
noise
and
the
target.
AsAs
shown
in in
Figure
10b,
global features
featurescannot
cannoteffectively
effectivelyseparate
separate
the
noise
and
the
target.
Asshown
shown
inFigure
Figure
10b,
the
recall
rate
of
the
target
by
the
CA
module
is
also
improved
by
0.8%.
In
the
detecthe
rate of
theoftarget
by thebyCA
is alsoisimproved
by 0.8%.
In theIn
detection
of
10b,recall
the recall
rate
the target
themodule
CA module
also improved
by 0.8%.
the detection
of distant
the network
withhas
CAa has
a good effect
on recognition
classifidistant
targets,targets,
the network
with CA
good
recognition
and and
classification.
tion of distant
targets,
the network
with CA
has aeffect
good on
effect
on recognition
and classification. Therefore,
in the subsequent
experiments
to verify
the model
detection
the
Therefore,
in the subsequent
experiments
to verify
model
effect,effect,
the
YOLO
cation. Therefore,
in the subsequent
experiments
tothe
verify
the detection
model detection
effect,
the
YOLO
V5_CA
and
YOLO
V5n
models
are
used
for
comparison.
V5_CA
and
YOLO
V5n
models
are
used
for
comparison.
YOLO V5_CA and YOLO V5n models are used for comparison.
1
1
YOLO v5n
YOLO v5n
+CBAM
+CBAM
+CA
+CA
0.9
0.9
0.8
0.8
0.961
0.961
0.7
0.7
0.946
0.946
0.939
0.939
0.1
0
0
YOLO V5n
YOLO V5n
+CBAM
+CBAM
+CA
+CA
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0
0.951 0.940
0.951 0.9400.943
0.943
Recall
Recall
mAP
mAP
0.5 0.5
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
1
0
100
0
100
200
300
200 Epoch 300
Epoch
(a)
(a)
400
400
500
500
0
100
100
200
300
Epoch 300
200
Epoch
(b)
(b)
400
400
500
500
Figure
Figure 10.
10. Comparison
Comparisonof
ofnetwork
networktraining
trainingresults:
results:(a)
(a)Mean
MeanofofAverage
AveragePrecision;
Precision;(b)
(b)Recall.
Recall.
Figure 10. Comparison of network training results: (a) Mean of Average Precision; (b) Recall.
Sensors
2022,
22,
xx FOR
Sensors 2022,
2022, 22,
22, 7940
FOR PEER
PEER REVIEW
REVIEW
Sensors
10
1010of
of
21
of 21
21
Figure
aa comparison
comparison
detection
YOLO
and
YOLO
Figure
11 is
is
chart
of
the
results
of
V5n
Figure 11
11
is a
comparison chart
chart of
of the
the detection
detection results
results of
of YOLO
YOLO V5n
V5n and
and YOLO
YOLO
V5_CA.
It
can
be
seen
that
when
the
characteristics
of
the
target
object
are
not
obvious,
V5_CA.
It can
can be
beseen
seenthat
thatwhen
whenthe
thecharacteristics
characteristics
target
object
obviV5_CA. It
of of
thethe
target
object
are are
not not
obvious,
there
will
aa problem
of
detection
in
of
model
ous,
will
be a problem
of missed
detection
in the detection
of the original
model
therethere
will be
be
problem
of missed
missed
detection
in the
the detection
detection
of the
the original
original
model (Figure
(Figure
11a).
and
characteristics
appear
at
time
(Figure
11a).objects
When objects
and with
noisesvery
withsimilar
very similar
characteristics
the same
11a). When
When
objects
and noises
noises
with
very
similar
characteristics
appearappear
at the
theatsame
same
time
(Figure
11b),
the
original
model
will
have
the
problem
of
misidentifying
the
noise
time
(Figure
11b),
the
original
model
will
have
the
problem
of
misidentifying
the
noise
as aaa
(Figure 11b), the original model will have the problem of misidentifying the noise as
as
target.
When
detecting
close-range
LED
targets
(Figure
11c,d),
the
original
model
will
target.
When
detecting
close-range
LED
targets
(Figure
11c,d),
the
original
model
will
fail
target. When detecting close-range LED targets (Figure 11c,d), the original model will fail
fail
to
detect
them.
The
model
with
the
CA
module
has
a
good
classification
effect
on
the
to
thethe
CACA
module
has has
a good
classification
effecteffect
on theon
target
to detect
detectthem.
them.The
Themodel
modelwith
with
module
a good
classification
the
target
and
and
the
accuracy
is
than
the
and
noise,
and
the
detection
accuracy
is higher
than the
original
model.model.
target
and noise,
noise,
and
the detection
detection
accuracy
is higher
higher
than
the original
original
model.
(a)
(a)
(b)
(b)
(c)
(c)
(d)
(d)
(e)
(e)
(f)
(f)
(g)
(g)
(h)
(h)
Figure
11.
detection
result:
YOLO
Figure 11.
11. Target
Target
detection result:
result: (a–d)
(a–d)
are
YOLO V5n
V5n detection
detection results;
results;
(e–h)
are
YOLO V5_CA
V5_CA
Figure
Target detection
(a–d) are
are YOLO
YOLO
V5n
detection
results; (e–h)
(e–h) are
are YOLO
V5_CA
detection
results.
detection results.
results.
detection
We
our
network
[31].
The
region
that
the
We
used Grad-CAM
Grad-CAMas
asaaavisualization
visualizationtool
toolfor
for
our
network
[31].
region
We used
used
Grad-CAM
as
visualization
tool
for
our
network
[31].
TheThe
region
thatthat
the
network
is
most
interested
in
is
the
one
with
the
redder
in
Figure
12.
It
is
clear
that
YOLO
the
network
is most
interested
thewith
one the
withredder
the redder
in Figure
It isthat
clear
that
network
is most
interested
in is in
theisone
in Figure
12. It is12.
clear
YOLO
V5
the
CA
module
outperforms
the
model
in
extraction
and
YOLO
with
CA module
outperforms
the original
in beacon
extraction
and
V5 with
withV5
the
CAthe
module
outperforms
the original
original
model model
in beacon
beacon
extraction
and localilocalization
noise
and
As
result,
in
target
recognition
and
exlocalization
for both
and light.
a result,
inunderwater
the underwater
target
recognition
zation for
for both
both
noisenoise
and light.
light.
As aa As
result,
in the
the
underwater
target
recognition
andand
extraction
stage,
the
recognizer
is
YOLO
v5
with
the
CA
module.
extraction
stage,
the
recognizer
is
YOLO
v5
with
the
CA
module.
traction stage, the recognizer is YOLO v5 with the CA module.
(a)
(a)
(b)
(b)
(b)
(b)
(c)
(c)
(d)
(d)
(d)
(d)
(e)
(e)
(f)
(f)
(g)
(g)
(h)
(h)
(h)
(h)
Figure
maps: (a–d)
are heat
maps of
YOLO V5n;
heat
maps
of
YOLO
V5_CA.
Figure 12.
12. Heat
Heat
V5n; (e–h)
(e–h) are
are heat
heat maps
maps of
of YOLO
YOLO V5_CA.
V5_CA.
Figure
12.
Heat maps:
maps: (a–d)
(a–d) are
are heat
heat maps
maps of
of YOLO
YOLO V5n;
(e–h)
are
Sensors 2022, 22, 7940
11 of 21
3.3. SRGAN and Zernike Moments-Based Sub-Pixel Optical Center Positioning Method
After correctly identifying and extracting the target, the size of the light source feature
points is only a few to a dozen pixels because the scale of the target at about 10 m is
extremely small. Conventional image processing techniques, such as filtering and picture
open and close operations, will overwhelm the target light source, making accurate location difficult. Therefore, we introduce super-resolution generative adversarial networks
(SRGAN) into the detection process and perform 4× upscaling on the identified beacon
image. The feature information of underwater small targets can be effectively improved by
using this technique, and it also offers a guarantee for the accuracy of subsequent subpixel
centroid positioning based on Zernike moments.
The core of SRGAN consists of two networks: a super-resolution generator and a
discriminator, where the discriminator uses the VGG19. First, apply a Gaussian filter to a
real high-resolution image Î HR with channel and size C × W × H to obtain a low-resolution
image I LR = C × rW × rH, where the scaling factor is r.This low-resolution image is then
used as input to the generator and trained to produce a high-resolution image I HR . The
original high-resolution image Î HR and the generated image I HR are both input to the
discriminator to obtain the perceptual loss function l SR of the generated image and the real
SR
SR [23]. Then the
image, which includes the content loss lVGG/i,j
and the adversarial loss lGen
relationship between the losses can be obtained as
SR
SR
l SR = lVGG/i,j
+ 10−3 lGen
,
where
SR
lVGG/i,j
(4)
2
Wi,j Hi,j 1
=
∑ φi,j I HR x,y − φi,j GθG I LR x,y ,
Wi,j Hi,j x∑
=1 y =1
SR
lGen
=
N
∑
− log DθD GθG I LR .
n =1
In Formula (4), GθG is the reconstructed high-resolution image, DθD is the probability
that an image belongs to a real high-resolution image, i and j represent the ith maximum
pooling layer and the jth convolution layer in the VGG network, respectively. We used the
high-definition camera Cannon D60 to collect 7171 high-definition images of underwater
optical beacons and trained 400 rounds under the condition of a scaling factor of 4. At the
same time, in order to verify the advantages of SRGAN, we used the results generated by
the SRResNet for comparison.
Figure 13 shows the training result of SRGAN, where the generator loss (Figure 13a)
drops to 5.07 × 10−3 in 400 epochs, indicating that the model converges well. Figure 12b
is the peak signal-to-noise ratio (PSNR), which is mainly aimed at the error between the
corresponding pixels of the generated image and the original image. The PSNR of the
generated image can reach up to 32.42 dB. Structural similarity index measurement (SSIM)
is an index used to measure the similarity of brightness, contrast, and structure between the
generated image and the original high-resolution image. It can be seen that the maximum
SSIM of the generated image can reach 91.98% (Figure 13c). The aforementioned data
demonstrates that this method produces low levels of image distortion, and the image has
good structural integrity and structural details.
12
1212
ofofof
21
2121
33
0.016
0.016
32
0.014
0.014
31
0.006
0.006
0.004
0.004
0
0
28
50
100
100
150
150
200
200
250
Epoch
Epoch
250
300
300
350
350
400
400
25
0.86
0.84
27
27
0.82
0.86
0.84
0.82
26
26
50
SSIM
29
0.9198
0.9198
0.88
0.88
PSNR(dB)
PSNR(dB)
0.00507
0.00507
32.42
32.42
30
28
0.008
0.008
0.9
0.9
31
29
0.01
0.92
0.92
32
30
0.012
0.012
0.01
33
SSIM
0.018
0.018
G Loss
G Loss
Sensors
2022,
x FOR
PEER
REVIEW
Sensors
2022,
22,
x7940
FOR
PEER
REVIEW
Sensors
2022,
22,22,
25
0
0.8
0
50
(a)(a)
50
100
100
150
150
200
200
250
Epoch
Epoch
(b)(b)
250
300
300
350
350
400
400
0.8
0
0
50
50
100
100
150
150
200
200
250
Epoch
Epoch
250
300
300
350
350
400
400
(c)(c)
Figure
SRGAN
training
results:
Generator
loss;
Peak
signal-to-noise
ratio;
Structural
Figure
13.
SRGAN
training
results:
(a)
Generator
loss;
(b)
Peak
signal-to-noise
ratio;
(c)
Structural
Figure
13.13.
SRGAN
training
results:
(a)(a)
Generator
loss;
(b)(b)
Peak
signal-to-noise
ratio;
(c)(c)
Structural
similarity
index
measurement.
similarity
index
measurement.
similarity index measurement.
Figure
displays
the
low-resolution
target
images,
the
target
images
produced
Figure
14
displays
the
low-resolution
target
images,
the
target
images
produced
by
Figure
1414
displays
the
low-resolution
target
images,
the
target
images
produced
byby
SRGAN,
and
the
target
images
produced
by
SRResNet
in
order
to
illustrate
the
benefits
SRGAN,
SRGAN, and
and the
the target
target images
images produced
produced by
by SRResNet
SRResNet in
in order
order to
to illustrate
illustrate the
the benefits
benefits
SRGAN.
can
be
clearly
seen
that
the
image
generated
SRResNet
will
produce
blurofof
SRGAN.
can
clearly
seen
that
the
image
generated
by
SRResNet
will
produce
of
SRGAN.
ItItIt
can
bebe
clearly
seen
that
the
image
generated
byby
SRResNet
will
produce
blurringproblems
problems
the
edge
the
target
light
source
(Figure
14c);
noise
points
willalso
also
blurring
problems
atedge
the
edge
oftarget
the
target
light
source
(Figure
14c);
noise
points
will
ring
atatthe
ofofthe
light
source
(Figure
14c);
noise
points
will
appear
near
the
target
feature
points
(Figure
14f);
the
background
will
produce
grid
noise
also
appear
near
the
target
feature
points
(Figure
14f);
the background
will
produce
grid
appear
near
the
target
feature
points
(Figure
14f);
the
background
will
produce
grid
noise
(Figure
14i).
These
noises
all
lead
errors
in
the
process
of
light
centroid
localizanoise
(Figure
14i).
These
noises
all
lead
toinerrors
in the of
process
ofsource
light
source
centroid
(Figure
14i).
These
noises
all
lead
toto
errors
the
process
light
source
centroid
localization,
while
the
images
generated
by
SRGAN
have
good
structural
integrity
and
less
noise
localization,
while
the
images
generated
by
SRGAN
have
good
structural
integrity
and
less
tion, while the images generated by SRGAN have good structural integrity and less noise
(Figure
14b,e,h).
noise (Figure
14b,e,h).
(Figure
14b,e,h).
(a)(a)
(b)(b)
(c)(c)
(d)(d)
(e)(e)
(f)(f)
(g)(g)
(h)(h)
(i)(i)
Original
images
Original
images
SRGAN
SRGAN
SRResNet
SRResNet
Detail
contrasts
Detail
contrasts
Figure14.
Comparisonof
theoriginal
originalimages
imagesand
andthe
thesuper-resolution
super-resolutionreconstruction
reconstructionimages:
images:
Figure
14.14.Comparison
Comparison
ofofthe
the
original
images
and
the
super-resolution
reconstruction
images:
Figure
(a,d,g)
are
the
original
target
images;
(b,e,h)
are
super-resolution
images
of
SRGAN;
(c,f,i)
are
super(a,d,g)
(a,d,g) are
are the
the original
originaltarget
targetimages;
images;(b,e,h)
(b,e,h)are
aresuper-resolution
super-resolutionimages
imagesof
ofSRGAN;
SRGAN;(c,f,i)
(c,f,i)are
aresupersuperresolution
images
of
SRResNet
(the
yellow
box
represents
the
SRGAN
image
detail,
and
the
red
box
resolution
resolution images
images of
of SRResNet
SRResNet (the
(the yellow
yellow box
box represents
represents the
the SRGAN
SRGAN image
image detail,
detail, and
and the
the red
red box
box
represents
the
SRResNet
image
detail).
represents
represents the
the SRResNet
SRResNet image
image detail).
detail).
Accuratelylocating
locatingthe
thecenter
centerof
eachlight
lightsource
sourcein
theimage
imageis
key
step.The
The
Accurately
key
step.
Accurately
locating
the
center
ofofeach
each
light
source
ininthe
the
image
is isaa akey
step.
The
process
of
the
centroid
extraction
stage
is
shown
in
Figure
15:
Firstly,
the
OTSU
method
process
process of
of the
the centroid
centroid extraction
extraction stage
stage is
is shown
shown in
in Figure
Figure 15:
15: Firstly,
Firstly, the
the OTSU
OTSU method
method
selected
for
threshold
segmentation,
and
then
the
image
corroded
and
expanded
isis
selected
for
threshold
segmentation,
and
then
the
image
isis
corroded
and
expanded
toto
is
selected
for
threshold
segmentation,
and
then
the
image
is
corroded
and
expanded
to
Sensors 2022, 22, x FOR PEER REVIEW
Sensors 2022, 22, 7940
13 of 21
13 of 21
smooth the
the light
light source
source edge
edge to
to obtain
obtain the
the regular
regularlight
lightsource
sourceedge.
edge. The
The sub-pixel
sub-pixel edge
edge
smooth
refinement
method
based
on
the
Zernike
moment
is
used
to
obtain
the
edge
of
the
subrefinement method based on the Zernike moment is used to obtain the edge of the sub-pixel
pixel
light
source.
Finally,
the
centroid
coordinates
of
each
light
source
are
extracted
by
light source. Finally, the centroid coordinates of each light source are extracted by the
the
centroid
formula.
centroid formula.
Start
SR target image input
SR gray image
Calculate the
threshold by OTSU
method
Image erode and inflation
Calculate the
subpixel edge and
locate the centers
End
Figure
Figure15.
15. Sub-pixel
Sub-pixel optical
optical center
centerpositioning
positioningflow
flowchart.
chart.
The
The difference
difference between
between the
the background
background of
of the
the feature
feature image
image and
and the
the target
target light
light
source
super-resolution
enhancement
hashas
been
obvious,
sourceafter
afterYOLO
YOLOV5_CA
V5_CAextraction
extractionand
and
super-resolution
enhancement
been
obviso
thesoOTSU
method
has has
been
selected
as as
thethe
threshold
segmentation
ous,
the OTSU
method
been
selected
threshold
segmentationalgorithm.
algorithm.The
The
OTSU
method
can
effectively
separate
the
target
light
source
part
and
the
background
part
OTSU method can effectively separate the target light source part and the background
from
the foreground.
It defines
the segmentation
threshold
as a solution
to maximize
the
part from
the foreground.
It defines
the segmentation
threshold
as a solution
to maximize
inter-class
variance.
The
scheme
is
established
as
follows:
the inter-class variance. The scheme is established as follows:
2 2
σ2 =
g2 − ge )22,
σ 2ω= 1 ·(
ω1g1( −
g1 −geg) e )++ωω2 ·(
2 ( g 2 − g e ) ,
(5)
(5)
where
and
ω2 are
that
where the
the between-class
between-class variance
varianceisisdenoted
denotedasasσ2σ, 2ω,1 ω
areprobabilities
the probabilities
ω2 the
1 and
one
pixel
belongs
to
the
target
area
or
the
background
area,
respectively.
g
and
g
are
the
2
1
that one pixel belongs to the target area or the background area, respectively. g1 and
average gray values of the target and the background pixels, respectively, while ge is the
g 2 are the average gray values of the target and the background pixels, respectively,
average
gray value of all pixels in the image. Assuming that the image is segmented into
while g e is the average gray value of all pixels in the image. Assuming that the image is
the target
area and the background area when the gray segmentation threshold is k, the
segmentedcumulative
into the target
areag and
the background area when the gray segmentation
first-order
moment
k of k is brought into Formula (5):
threshold is k , the first-order cumulative moment g k of k is brought into Formula (5):
( ge · ω − g ) 2 2
σ2 =2 ( g e 1ω1 − kg k ).
(6)
σ ω
= 1 ·(1 − ω1 ) .
(6)
ω1 (1 − ω1 )
By traversing the 0–255 gray values in Formula (6), the corresponding σ2 is the
required
σ 2difference
By traversing
0–255 gray
values
in Formula
the corresponding
is the rethreshold
when thethe
variance
between
classes
is the (6),
largest.
There is a great
quired
threshold
when
the
variance
between
classes
is
the
largest.
There
is
a
great
differbetween the light source characteristics and background characteristics of an underwater
ence
between
the
light
source
characteristics
and
background
characteristics
of
an
underoptical beacon, so the OTSU method is stable.
water
beacon,
so the OTSU
method
is stable.
Tooptical
precisely
determine
the pixel
coordinates
of each light center, it is necessary to
To
precisely
determine
the
pixel
coordinates
light
center,
it is necessary
to
obtain the edges of each light source after obtaining of
theeach
binary
image
following
threshold
obtain the edges
of each light
sourceedge
after detection
obtainingmethod
the binary
image
following
segmentation.
Therefore,
a sub-pixel
based
on the
Zernikethreshold
moment
segmentation.
Therefore,
sub-pixel
method
basednoise
on the
Zernike The
mois
used. This method
is notaaffected
by edge
imagedetection
rotation and
has good
endurance.
ment
is
used.
This
method
is
not
affected
by
image
rotation
and
has
good
noise
endurance.
n-order and m-order Zernike moments of the image f ( x, y) are defined as follows:
The n -order and m -order Zernike moments of the image f ( x, y ) are defined as foln+1 x
∗
lows:
Znm =
f ( x, y)Vnm
(7)
(ρ, θ )dxdy,
π 2 2
n +x1 +y ≤1
*
Z nm =
f ( x, y ) Vnm ( ρ , θ ) dxdy ,
(7)
π x +∫∫y ≤1
2
2
Sensors 2022, 22, x FOR PEER REVIEW
14 of 21
Sensors 2022, 22, 7940
14 of 21
*
where Vnm
( ρ ,θ ) is conjugate to the orthogonal n -order and m -order Zernike polyno∗
where
is the
conjugate
to the orthogonal
n-order
and
m-order the
Zernike
mial VVnm
polar coordinate
system unit
circle.
Assuming
ideal polynomial
edge rotates
, θθ)) of
nm((ρρ,
Vθnm (ρ, θ ) of the polar coordinate system unit circle. Assuming the ideal edge rotates
, because of the rotational invariance of Zernike moments: Z ′ = Z , Z ′ = Z11eiiθθ ,
θ, because of the rotational invariance of Zernike moments: Z 0 00 =00 Z0000, Z 0 11 11= Z11
e ,
′ Z 20 , the sub-pixel edge of the image can be represented by Formula (8):
20 =
ZZ0 20
= Z20 , the sub-pixel edge of the image can be represented by Formula (8):
ϕ  xs x  Nd cos
 
Nd
=
cos .ϕ
s   x +
x

=
+
ϕ .
 yys   y y 2 2 sinsin
ϕ
s
(8)
(8)
In Formula (8), d is the distance from the center of the unit circle to the ideal edge
In Formula (8), d is the distance from the center of the unit circle to the ideal edge in
in the polar coordinate system, and N is a template for Zernike moments, which imthe polar coordinate system, and N is a template for Zernike moments, which improves
proves the accuracy as its size increases but increases the calculation time.
the accuracy as its size increases but increases the calculation time.
Figure1616shows
showsthe
the
detection
comparison
of the
low-resolution
target
images
and
Figure
detection
comparison
of the
low-resolution
target
images
and the
the
super-resolution
enhanced
images
using
the
method
described
in
this
paper.
It
is
evisuper-resolution enhanced images using the method described in this paper. It is evident
dent
that
the
direct
OTSU
and
Zernike
edge
detection
algorithms
are
unable
to
find
all
of
that the direct OTSU and Zernike edge detection algorithms are unable to find all of the
the
light
source
centers
when
the
target
light
source
structure
is
incomplete.
When
the
light source centers when the target light source structure is incomplete. When the light
light source
far away,
the light
source
center
cannot
detected
due
thesmall
smallnumber
number
source
is far is
away,
the light
source
center
cannot
be be
detected
due
toto
the
of
pixels
occupied
by
each
light
source.
The
super-resolution
enhancement
and
Zernike
of pixels occupied by each light source. The super-resolution enhancement and Zernike
moment
sub-pixel
detection
method
can
well
extract
the
target
sub-pixel
center
coordinate
moment sub-pixel detection method can well extract the target sub-pixel center coordinate
accuracyof
ofeach
eachlight
lightsource
sourceto
to0.001
0.001pixels,
pixels,which
whichprovides
providesaaguarantee
guaranteefor
forsubsequent
subsequent
accuracy
accurate
attitude
calculation.
accurate attitude calculation.
o
o
x
x
( 7.437,4.382 )
( 4.938,6.527 )
( 9.519,6.788 )
( 7.047,8.967 )
( 3.719,3.007 )
( 5.517,4.629 )
( 3.718,6.366 )
( 3.500,3.000)
y
y
(a)
(b)
Figure16.
16. Comparison
Comparison between
between the
thetraditional
traditionalalgorithm
algorithmand
andthe
thesubpixel
subpixelcentroid
centroidlocalization
localization
Figure
method based on SRGAN and Zernike moments (the top-row images are targets, and the bottommethod based on SRGAN and Zernike moments (the top-row images are targets, and the bottom-row
row images are results): (a) Results of OTSU threshold segmentation + Zernike moment sub-pixel
images are results): (a) Results of OTSU threshold segmentation + Zernike moment sub-pixel center
center search; (b) Results of our method.
search; (b) Results of our method.
Experimentson
onAlgorithm
AlgorithmAccuracy
Accuracyand
andPerformance
Performance
4.4.Experiments
Underthe
thecondition
condition
that
feature
points
inworld
the world
coordinate
of the
Under
that
thethe
feature
points
in the
coordinate
systemsystem
of the object
object
and
its
corresponding
pixel
coordinates
in
the
image
coordinate
system
are
known,
and its corresponding pixel coordinates in the image coordinate system are known, the
the problem
of solving
the relative
position
between
the object
andcamera
the camera
is called
problem
of solving
the relative
position
between
the object
and the
is called
the
the perspective-n-points
(PnP) problem.
Accurately
solving
thisgenerally
problemrequires
generally
reperspective-n-points
(PnP) problem.
Accurately
solving this
problem
more
quires
more
thancorresponding
four known corresponding
points.
has done
the following
than
four
known
points. This section
hasThis
donesection
the following
experiments:
1.experiments:
Compare the traditional PnP algorithms, OPnP, LHM decomposition, and SRPnP in
1.
2.
2.
3.
3.
Compare
traditional
PnPsmall
algorithms,
decomposition,
and SRPnP in
solving
thethe
coplanar
4-point
optical OPnP,
beaconLHM
translation
distance error;
solving the
coplanar
4-point
optical algorithm
beacon translation
Compare
the
accuracy
of thesmall
traditional
with thedistance
methoderror;
described in
Compare
the
accuracy
of
the
traditional
algorithm
with
the
method
described
in SecSection 3;
tion 3;
Compare
the running speed of the algorithm before and after adding the superCompare the
running speed of the algorithm before and after adding the super-resresolution
enhancement.
olution
enhancement.
In order to compare the average relative error and range accuracy of the OPnP, LHM,
and SRPnP algorithms, 9 groups of 450 sample data were sampled 50 times every 1 m in the
Sensors 2022, 22, x FOR PEER REVIEW
Sensors 2022, 22, 7940
15 of 21
15 of 21
In order to compare the average relative error and range accuracy of the OPnP, LHM,
and SRPnP algorithms, 9 groups of 450 sample data were sampled 50 times every 1 m in
the range of 10–2 m. The average relative errors of the three algorithms are shown in Figrange
10–2
m. Thedetection
average relative
of the three algorithms
shown
in Figureare
17.
ure
17.of
The
average
distanceerrors
and experimental
data of theare
three
algorithms
The
average
detection
distance
and
experimental
data
of
the
three
algorithms
are
shown
shown in Table 1.
in Table 1.
7200
6400
Experiment
LHM
OPnP
SRPnP
7150
experiment
LHM
OPnP
SRPnP
6300
7100
6200
Distance(mm)
Distance(mm)
7050
6100
7000
6950
6000
6900
5900
6850
5800
6800
6750
0
10
20
30
Points Samples
40
5700
50
0
10
20
30
40
Points Samples
50
(b)
(a)
5250
experiment
LHM
OPnP
SRPnP
5200
Distance(mm)
5100
5050
SRPnP
OPnP
LHM
2.5
Average Relative Error(%)
5150
3
2
1.5
5000
4950
1
4900
0.5
4850
4800
0
10
20
30
Points Samples
40
0
50
7m
6m
5m
4m
Samples
3m
2m
(d)
(c)
Figure
PnP
algorithm
distance
detection
results:
(a) Detection
results
of optical
beaFigure17.
17.Traditional
Traditional
PnP
algorithm
distance
detection
results:
(a) Detection
results
of optical
cons within 7 m; (b) Detection results of optical beacons within 6 m; (c) Detection results of optical
beacons within 7 m; (b) Detection results of optical beacons within 6 m; (c) Detection results of optical
beacons within 5 m; (d) The average relative error of translation.
beacons within 5 m; (d) The average relative error of translation.
Table 1. Experimental data and algorithm detection results.
Table 1. Experimental data and algorithm detection results.
Sample Groups
Average Experiment Results
Sample
(mm) Groups
1
2
3
4
5
6
7
8
9
10,344.00
8892.00 1
7815.00
2
6968.00
6122.00 3
5015.00 4
4082.00
3012.00 5
1987.00
Average Experi- Average LHM Average OPnP Average SRPnP
Average LHM Results
Average OPnP Results
Average SRPnP Results
ment Results
Results (mm)
Results
Results
(mm)
(mm)
(mm)
(mm) None
(mm)
(mm)
None
None
None
None
10,344.00
None None
None
None
None
None
8892.00
None None
None
None
7167.80
7143.45
7143.00
7815.00
None 6234.07
None
None
6256.45
6234.07
5095.59
5075.22
6968.00
7167.80 5072.82 7143.45
7143.00
4126.44
4126.27
4124.44
6122.00
6256.45 3050.45 6234.07
6234.07
3050.97
3049.41
2014.62
2014.34
2014.14
6
7
8
9
Sensors 2022, 22, 7940
5015.00
4082.00
3012.00
1987.00
5095.59
4126.44
3050.97
2014.62
5072.82
4126.27
3050.45
2014.34
5075.22
4124.44
3049.41
16 of 21
2014.14
The experimental data in Table 1 was measured using a high-precision translation
The experimental
data
in Table
1 was measured
a high-precision
platform.
The 50 samples
in the
PnP algorithm
solutionusing
data of
each group aretranslation
randomly
platform.from
The 50
the PnP
solution data
of each group
are randomly
obtained
thesamples
shootinginvideos
at algorithm
different distances.
By analyzing
the data,
it can be
obtained
from
the
shooting
videos
at
different
distances.
By
analyzing
the
data,
canthe
be
concluded that when the small optical beacon is far away from the camera (10–7itm),
concluded
that
when
the
small
optical
beacon
is
far
away
from
the
camera
(10–7
m),
the
traditional PnP algorithm cannot be solved because it cannot obtain the coordinates of the
traditional
algorithm
cannot
be solved system.
because In
it cannot
obtain
thelong
coordinates
the
four
featurePnP
points
in the pixel
coordinate
the middle
and
distanceof(5–7
four
feature
points
in
the
pixel
coordinate
system.
In
the
middle
and
long
distance
(5–7
m)
m) range, the accuracy of the three LHM iterative algorithms is low, and the average relrange,
the accuracy
of the three
LHM
low, and
the
average
relative
ative
error
is about 2.53%.
Overall,
theiterative
solutionalgorithms
accuracy ofisSRPnP
and
OPnP
algorithms
error
is
about
2.53%.
Overall,
the
solution
accuracy
of
SRPnP
and
OPnP
algorithms
is not
is not much different, and the average relative errors are 1.51% and 1.53%, respectively.
much
different,
and
the
average
relative
errors
are
1.51%
and
1.53%,
respectively.
In
In order to reflect the accuracy and the detection range of our method, it is compared order
with
to reflect
theresults
accuracy
thein
detection
range of our method, it is compared with SRPnP.
SRPnP.
The
are and
shown
Figure 18.
The results are shown in Figure 18.
9300
×10 4
1.1
Experiment
SRPnPsr
Experiment
SRPnPsr
9200
1.08
9100
Distance(mm)
Distance(mm)
1.06
1.04
1.02
9000
8900
8800
1
8700
0.98
8600
0.96
8500
0
10
20
30
Points Samples
40
50
10
0
20
30
Points Samples
(b)
(a)
8100
SRPnP
SRPnP
sr
2.5
Average Relative Error(%)
7900
Distance(mm)
3
Experiment
SRPnPsr
8000
2
1.5
7800
7700
7600
7500
50
40
1
0.5
0
10
20
30
Points Samples
(c)
40
50
0
10m
9m
8m
7m
6m
Samples
5m
4m
3m
2m
(d)
Figure
results are
arebased
basedon
onsuper-resolution
super-resolution
image
enhancement
a subpixel
cenFigure 18.
18. Ranging
Ranging results
image
enhancement
andand
a subpixel
centroid
troid
localization
method:
(a)
Detection
results
of
optical
beacons
within
10
m;
(b)
Detection
results
localization method: (a) Detection results of optical beacons within 10 m; (b) Detection results of
optical beacons within 9 m; (c) Detection results of optical beacons within 8 m; (d) The average
relative error of translation.
By examining the data in Figure 18, it can be seen that the issue with the feature points
being unable to be recognized and located at a great distance (10–7 m) has been resolved,
and the feature point extraction range of the remote small optical beacon has been greatly
Sensors 2022, 22, 7940
being unable to be recognized and located at a great distance (10–7 m) has been resolv
and the feature point extraction range of the remote small optical beacon has been grea
improved. The average relative error of the SRPnP algorithm in solving a 10–7 m targe
1.25%, and the average relative error in short distances (within 7 m) is 0.83%. From
of 21
above data, we can see that our method reduces the calculation error by 1733.6%
and i
proves the calculation accuracy.
In
order to further reflect the efficiency of the algorithm, our algorithm and the t
improved. The average relative error of the SRPnP algorithm in solving a 10–7 m target
ditional
algorithm
compared
time
under
the same
hardware
and software
con
is 1.25%,
and theare
average
relative in
error
in short
distances
(within
7 m) is 0.83%.
From
above
data,
we can see
thethe
calculation
error by
33.6%
andand the
tions.the
The
video
capture
ratethat
of our
the method
camerareduces
used in
experiment
is 20
FPS,
improves
the
calculation
accuracy.
gorithms used in this experiment are tested on a personal computer equipped with an
In order to further reflect the efficiency of the algorithm, our algorithm and the tradi8750 CPU,
32 g of memory, and Windows 10. We used Visual Studio 2017 and Qt 5.9.8
tional algorithm are compared in time under the same hardware and software conditions.
the pose
algorithm
andinneural
networkis migration
without
GPU accel
The video
captureimplementation
rate of the camera used
the experiment
20 FPS, and the
algorithms
ation.used
In Figure
19,
the
computing
time
of
our
method
is
0.063
s
(15.87
FPS)
at a long d
in this experiment are tested on a personal computer equipped with an i7-8750 CPU,
g of the
memory,
and Windows
10.the
We SRPnP
used Visual
Studio 2017
and Qt
forFPS).
the pose
tance,32and
computing
time of
algorithm
is 0.060
s 5.9.8
(16.67
The comp
algorithm
implementation
and
neural
network
migration
without
GPU
acceleration.
In
ting time of our method is 0.101 s (10.17 FPS) at a short distance, and the com-puting
ti
Figure 19, the computing time of our method is 0.063 s (15.87 FPS) at a long distance, and
of thetheSRPnP
algorithm is 0.093 s (10.83 FPS). The average detection speed of our alg
computing time of the SRPnP algorithm is 0.060 s (16.67 FPS). The computing time
rithmofinour
the
whole
is 0.088
s (11.36
FPS).
It can
bethe
seen
that thetime
timeofconsumpti
method
is range
0.101 s (10.17
FPS)
at a short
distance,
and
com-puting
the
(10.83 different.
FPS). The average
ourof
algorithm
in image
of theSRPnP
two algorithm
methods isis0.093
not smuch
This isdetection
becausespeed
the of
size
the target
whole
is 0.088 s (11.36
FPS). It can be seen that
time consumption
of the
small,theand
therange
combination
of super-resolution
andthe
sub-pixel
algorithm
hastwo
a small i
methods is not much different. This is because the size of the target image is small, and the
pact.
combination of super-resolution and sub-pixel algorithm has a small impact.
0.15
SRPnP
SRPnP sr
0.14
0.085
SRPnP
SRPnP sr
0.08
0.13
0.075
Time(s)
0.12
Time (s)
0.11
0.07
0.065
0.1
0.09
0.06
0.08
0.055
0.07
0.05
0.06
0
20
40
Frames
60
(a)
80
100
0
20
40
Frames
60
80
100
(b)
Figure
19. Comparison
ofofdetection
speeds
each
frame
image
The operating
speed of algorit
Figure
19. Comparison
detection speeds
ofof
each
frame
image
(a) The(a)
operating
speed of algorithm
in (b)
5 m;The
(b) The
operatingspeed
speed of
in 10–5
m. m.
in 5 m;
operating
of algorithm
algorithm
in 10–5
Figure 20 is a graph of the dynamic positioning results of our algorithm, in which the
Figure
20 is a graph of the dynamic positioning results of our algorithm, in which
solid lines are the solution results in the X, Y, and Z axis directions, respectively, and the
soliddotted
lines are
results
the X,platform.
Y, and ItZcan
axis
respectively,
linesthe
are solution
the readings
of the in
moving
bedirections,
seen from the
results that and
ourlines
method
detection
and theplatform.
advantage It
of can
real-time
performance.
dotted
arehas
thehigh
readings
ofaccuracy
the moving
be seen
from the results th
In
this
paper,
LED
light
beacon
arrays
of
different
colors
and
shapes
are
our method has high detection accuracy and the advantage of real-timedesigned
performance.
to verify the accuracy of the algorithm, as shown in Figure 21. Therefore, if the ranging
experiment of the system with multiple AUVs is designed, the color recognition function
can be added after the target detection to perform ranging detection on the installation of
different types of optical beacons. It can be seen that the method described in this paper
has a good solution effect, and it also has a good discrimination effect on the illusion of
water surface and lens glass.
12,000
X calculated
Y calculated
Z calculated
X measured
Y measured
Z measured
10,000
8000
Coordinate Value (mm)
Sensors
FOR PEER REVIEW
Sensors 2022,
2022, 22,
22, x7940
1818ofof 21
21
6000
12,000
X calculated
Y calculated
Z calculated
X measured
Y measured
Z measured
4000
Coordinate Value (mm)
10,000
2000
8000
0
6000
-2000
0
4000
10
20
Time (s)
30
40
Figure 20. Dynamic positioning experiment results.
2000
In this paper, LED light beacon arrays of different colors and shapes are designed to
verify0 the accuracy of the algorithm, as shown in Figure 21. Therefore, if the ranging experiment of the system with multiple AUVs is designed, the color recognition function
can-2000
be added after the target detection to perform ranging detection on the installation of
0
10
20
30
40
different types of opticalTime
beacons.
It can be seen that the method described in this paper
(s)
has a good solution effect, and it also has a good discrimination effect on the illusion of
Figure
20.
Dynamic
positioning
experiment results.
results.
Figure
Dynamic
water 20.
surface
andpositioning
lens glass.experiment
In this paper, LED light beacon arrays of different colors and shapes are designed to
(a)
(b)
verify the accuracy of the algorithm, as shown in Figure 21. Therefore, if the ranging experiment of the system with multiple AUVs is designed, the color recognition function
can be added after the target detection to perform ranging detection on the installation of
different types of optical beacons. It can be seen that the method described in this paper
has a good solution effect, and it also has a good discrimination effect on the illusion of
water surface and lens glass.
(a)
(b)
(c)
(d)
(c)
(d)
Figure 21.
21. Underwater
Underwater optical
optical beacon
beacon attitude
attitude calculation
calculation effect
effect diagram:
diagram: (a)
(a) Optical
Optical beacon
beacon of
of the
the
Figure
green
cross;
(b)
Optical
beacon
of
the
blue
cross;
(c)
Optical
beacon
of
the
green
trapezoid;
(d)
Optigreen cross; (b) Optical beacon of the blue cross; (c) Optical beacon of the green trapezoid; (d) Optical
cal beacon of the blue trapezoid.
beacon of the blue trapezoid.
5. Discussion
Table 2 shows the performance comparison between the existing underwater optical
beacon detection algorithm and the algorithm described in this paper. It can be seen that
the algorithm in this paper has high accuracy and a long detection range for optical beacons
Figure
Underwater
optical
beacon attitude
effect diagram:
(a) Optical beacon
of not
the
much 21.
smaller
than the
conventional
size, calculation
and the algorithm’s
time-consuming
does
green cross; (b) Optical beacon of the blue cross; (c) Optical beacon of the green trapezoid; (d) Optical beacon of the blue trapezoid.
Sensors 2022, 22, 7940
19 of 21
increase significantly. This shows that this paper provides an efficient and accurate optical
beacon finding and positioning method for the end-docking of small AUVs.
Table 2. Performance comparison between existing algorithms and our algorithm.
Method
Optical Beacon Size
(mm)
Detection Range
(m)
Detection Speed
(s)
Average Relative Error
R. L.’s [15]
S. L.’s [16]
R. R.’s [17]
Z. Y.’s [32]
Ours
100
2014
280
600
88
3.6
6.5
8.0
4.5
10
0.015
0.120
0.059
0.050
0.088
2.00%
0.14%
5.00%
4.44%
1.04%
6. Conclusions
In this paper, a quick underwater monocular camera positioning technique for compact
4-light beacons is presented. It combines deep learning and conventional image processing
techniques. The second part introduces the experimental equipment and system in detail.
A YOLO v5 target detection model with a coordinated attention mechanism is constructed
and compared with the original model and the model with CBAM. The model has a
classification accuracy of 96.1% for small optical beacons, which is 1.5% higher than the
original network structure, and the recall is also increased by 0.8%. A sub-pixel centroid
localization method combining SRGAN super-resolution image enhancement and Zernike
moments is proposed, which improves the feature localization accuracy of small target light
sources to 0.001 pixels. Finally, experimental verification shows that our method extends
the detection range of small optical beacons to 10 m, controls the average relative error of
distance detection at 1.04%, and has a detection speed of 0.088 ms (11.36 FPS).
Our method proposes a feasible monocular vision ranging scheme for small underwater optical beacons, which has the advantages of fast calculation speed and high precision.
The combination of super-resolution enhancement and sub-pixel edge refinement is not
limited to underwater optical beacon finding in AUV docking, and it can also be extended
to other object detection fields. For example, satellite image remote sensing and small target
detection tasks in medical images.
However, our method also has certain limitations. For example, the optical beacons
and laser probes used in the experiment are only fixed on the high-precision rotary device,
which is limited by the fixing method of the equipment and the moving mode of the rotary
device, which cannot simulate the problems faced by the dynamic docking of AUVs in real
situations. In this paper, optical beacons of various shapes and colors are designed to face
the problem of visual positioning in the multi-AUV working system. However, limited by
the manufacturing cost of the equipment, the multi-AUV docking experiment has not been
carried out. Therefore, fixing the optical beacon and the laser probe in the full-size AUV
for sea trials, verifying the performance of the algorithm under dynamic conditions, and
designing the visual recognition of the multi-AUV system are the next research directions.
Author Contributions: Conceptualization, B.Z.; Data curation, T.Z. and L.S.; Investigation, T.Z.;
Methodology, B.Z.; Project administration, P.Z.; Resources, T.Z. and L.S.; Software, B.Z.; Supervision,
P.Z. and F.Y.; Writing—original draft, B.Z.; Writing—review & editing, P.Z. and F.Y. All authors have
read and agreed to the published version of the manuscript.
Funding: This research was funded by National Natural Science Foundation of China, grant number
51975116 and Natural Science Foundation of Shanghai, grant number 21ZR1402900.
Data Availability Statement: The data presented in this study are available on request from the
corresponding author.
Conflicts of Interest: The authors declare no conflict of interest.
Sensors 2022, 22, 7940
20 of 21
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
Hsu, H.Y.; Toda, Y.; Yamashita, K.; Watanabe, K.; Sasano, M.; Okamoto, A.; Inaba, S.; Minami, M. Stereo-vision-based AUV
navigation system for resetting the inertial navigation system error. Artif. Life Robot. 2022, 27, 165–178. [CrossRef]
Guo, Y.; Bian, C.; Zhang, Y.; Gao, J. An EPnP Based Extended Kalman Filtering Approach forDocking Pose Estimation ofAUVs.
In Proceedings of the International Conference on Autonomous Unmanned Systems (ICAUS 2021), Changsha, China, 24–26
September 2021; Springer Science and Business Media Deutschland GmbH: Changsha, China, 2022; pp. 2658–2667.
Dong, H.; Wu, Z.; Wang, J.; Chen, D.; Tan, M.; Yu, J. Implementation of Autonomous Docking and Charging for a Supporting
Robotic Fish. IEEE Trans. Ind. Electron. 2022, 1–9. [CrossRef]
Bosch, J.; Gracias, N.; Ridao, P.; Istenic, K.; Ribas, D. Close-Range Tracking of Underwater Vehicles Using Light Beacons. Sensors
2016, 16, 429. [CrossRef] [PubMed]
Wynn, R.B.; Huvenne, V.A.I.; Le Bas, T.P.; Murton, B.J.; Connelly, D.P.; Bett, B.J.; Ruhl, H.A.; Morris, K.J.; Peakall, J.; Parsons,
D.R.; et al. Autonomous Underwater Vehicles (AUVs): Their past, present and future contributions to the advancement of marine
geoscience. Mar. Geol. 2014, 352, 451–468. [CrossRef]
Jacobi, M. Autonomous inspection of underwater structures. Robot. Auton. Syst. 2015, 67, 80–86. [CrossRef]
Loebis, D.; Sutton, R.; Chudley, J.; Naeem, W. Adaptive tuning of a Kalman filter via fuzzy logic for an intelligent AUV navigation
system. Control Eng. Pract. 2004, 12, 1531–1539. [CrossRef]
Sans-Muntadas, A.; Brekke, E.F.; Hegrenaes, O.; Pettersen, K.Y. Navigation and Probability Assessment for Successful AUV
Docking Using USBL. In Proceedings of the 10th IFAC Conference on Manoeuvring and Control of Marine Craft, Copenhagen,
Denmark, 24–26 August 2015; pp. 204–209.
Kinsey, J.C.; Whitcomb, L.L. Preliminary field experience with the DVLNAV integrated navigation system for oceanographic
submersibles. Control Eng. Pract. 2004, 12, 1541–1549. [CrossRef]
Marani, G.; Choi, S.K.; Yuh, J. Underwater autonomous manipulation for intervention missions AUVs. Ocean. Eng. 2009,
36, 15–23. [CrossRef]
Nicosevici, T.; Garcia, R.; Carreras, M.; Villanueva, M.; IEEE. A review of sensor fusion techniques for underwater vehicle
navigation. In Proceedings of the Oceans ’04 MTS/IEEE Techno-Ocean ’04 Conference, Kobe, Japan, 9–12 November 2004;
pp. 1600–1605.
Kondo, H.; Ura, T. Navigation of an AUV for investigation of underwater structures. Control Eng. Pract. 2004, 12, 1551–1559. [CrossRef]
Bonin-Font, F.; Massot-Campos, M.; Lluis Negre-Carrasco, P.; Oliver-Codina, G.; Beltran, J.P. Inertial Sensor Self-Calibration in a
Visually-Aided Navigation Approach for a Micro-AUV. Sensors 2015, 15, 1825–1860. [CrossRef]
Li, Y.; Jiang, Y.; Cao, J.; Wang, B.; Li, Y. AUV docking experiments based on vision positioning using two cameras. Ocean Eng.
2015, 110, 163–173. [CrossRef]
Zhong, L.; Li, D.; Lin, M.; Lin, R.; Yang, C. A Fast Binocular Localisation Method for AUV Docking. Sensors 2019,
19, 1735. [CrossRef]
Liu, S.; Ozay, M.; Okatani, T.; Xu, H.; Sun, K.; Lin, Y. Detection and Pose Estimation for Short-Range Vision-Based Underwater
Docking. IEEE Access 2019, 7, 2720–2749. [CrossRef]
Ren, R.; Zhang, L.; Liu, L.; Yuan, Y. Two AUVs Guidance Method for Self-Reconfiguration Mission Based on Monocular Vision.
IEEE Sens. J. 2021, 21, 10082–10090. [CrossRef]
Venkatesh Alla, D.N.; Bala Naga Jyothi, V.; Venkataraman, H.; Ramadass, G.A. Vision-based Deep Learning algorithm for
Underwater Object Detection and Tracking. In Proceedings of the OCEANS 2022-Chennai, Chennai, India, 21–24 February 2022;
Institute of Electrical and Electronics Engineers Inc.: Chennai, India, 2022.
Sun, K.; Han, Z. Autonomous underwater vehicle docking system for energy and data transmission in cabled ocean observatory
networks. Front. Energy Res. 2022, 10, 1232. [CrossRef]
Jocher, G. YOLOv5 Release v6.0. Available online: https://github.com/ultralytics/yolov5/tree/v6.0 (accessed on 12 October 2021).
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European
Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
Hou, Q.; Zhou, D.; Feng, J.; Ieee Comp, S.O.C. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Electr Network, 19–25 June 2021; pp. 13708–13717.
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z. Photo-realistic
single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 4681–4690.
Khotanzad, A.; Hong, Y.H. Invariant image recognition by Zernike moments. IEEE Trans. Pattern Anal. Mach. Intell. 1990,
12, 489–497. [CrossRef]
Otsu, N. A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [CrossRef]
Lu, C.-P.; Hager, G.D.; Mjolsness, E. Fast and globally convergent pose estimation from video images. IEEE Trans. Pattern Anal.
Mach. Intell. 2000, 22, 610–622. [CrossRef]
Zheng, Y.; Kuang, Y.; Sugimoto, S.; Astrom, K.; Okutomi, M. Revisiting the pnp problem: A fast, general and optimal solution. In
Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 2344–2351.
Wang, P.; Xu, G.; Cheng, Y.; Yu, Q. A simple, robust and fast method for the perspective-n-point problem. Pattern Recognit. Lett.
2018, 108, 31–37. [CrossRef]
Sensors 2022, 22, 7940
29.
30.
31.
32.
21 of 21
Baiden, G.; Bissiri, Y.; Masoti, A. Paving the way for a future underwater omni-directional wireless optical communication
systems. Ocean Eng. 2009, 36, 633–640. [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934.
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via
gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22-29 October
2017; pp. 618–626.
Yan, Z.; Gong, P.; Zhang, W.; Li, Z.; Teng, Y. Autonomous Underwater Vehicle Vision Guided Docking Experiments Based on
L-Shaped Light Array. IEEE Access 2019, 7, 72567–72576. [CrossRef]
Download