Uploaded by Theo pilus

Using Pruning-Based YOLO v3 Deep Learning Algorith

advertisement
animals
Article
Using Pruning-Based YOLOv3 Deep Learning Algorithm for
Accurate Detection of Sheep Face
Shuang Song 1 , Tonghai Liu 1, * , Hai Wang 2 , Bagen Hasi 2 , Chuangchuang Yuan 1 , Fangyu Gao 1
and Hongxiao Shi 2, *
1
2
*
College of Computer and Information Engineering, Tianjin Agricultural University, Tianjin 300384, China;
shuang_song0809@163.com (S.S.); yuanchuangchuanga@163.com (C.Y.); gaofyu1@163.com (F.G.)
Institute of Grassland Research, Chinese Academy of Agricultural Sciences, Hohhot 010010, China;
wanghai@caas.cn (H.W.); hasibagen@caas.cn (B.H.)
Correspondence: tonghai_1227@126.com (T.L.); shihongxiao@caas.cn (H.S.); Tel.: +86-13920136245 (T.L.);
+86-15049181288 (H.S.)
Simple Summary: The identification of individual animals is an important step in the history of
precision breeding. It has a great role in both breeding and genetic management. The continuous
development of computer vision and deep learning technologies provides new possibilities for the
establishment of accurate breeding models. This helps to achieve high productivity and precise
management in precision agriculture. Here, we demonstrate that sheep faces can be recognized based
on the YOLOv3 target detection network. A model compression method based on K-means clustering
algorithm and combined with channel pruning and layer pruning is applied to individual sheep
identification. In addition, the results show that the proposed non-contact sheep face recognition
method can identify sheep quickly and accurately.
Citation: Song, S.; Liu, T.; Wang, H.;
Hasi, B.; Yuan, C.; Gao, F.; Shi, H.
Using Pruning-Based YOLOv3 Deep
Learning Algorithm for Accurate
Detection of Sheep Face. Animals
2022, 12, 1465. https://doi.org/
10.3390/ani12111465
Academic Editors: Guoming Li,
Yijie Xiong, Hao Li and
Cesare Castellini
Received: 27 March 2022
Accepted: 3 June 2022
Published: 5 June 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affiliations.
Abstract: Accurate identification of sheep is important for achieving precise animal management and
welfare farming in large farms. In this study, a sheep face detection method based on YOLOv3 model
pruning is proposed, abbreviated as YOLOv3-P in the text. The method is used to identify sheep in
pastures, reduce stress and achieve welfare farming. Specifically, in this study, we chose to collect
Sunit sheep face images from a certain pasture in Xilin Gol League Sunit Right Banner, Inner Mongolia,
and used YOLOv3, YOLOv4, Faster R-CNN, SSD and other classical target recognition algorithms to
train and compare the recognition results, respectively. Ultimately, the choice was made to optimize
YOLOv3. The mAP was increased from 95.3% to 96.4% by clustering the anchor frames in YOLOv3
using the sheep face dataset. The mAP of the compressed model was also increased from 96.4% to
97.2%. The model size was also reduced to 1/4 times the size of the original model. In addition, we
restructured the original dataset and performed a 10-fold cross-validation experiment with a value of
96.84% for mAP. The results show that clustering the anchor boxes and compressing the model using
this dataset is an effective method for identifying sheep. The method is characterized by low memory
requirement, high-recognition accuracy and fast recognition speed, which can accurately identify
sheep and has important applications in precision animal management and welfare farming.
Keywords: YOLOv3; sheep face recognition; K-means clustering; model compression
Copyright: © 2022 by the authors.
1. Introduction
Licensee MDPI, Basel, Switzerland.
With the rapid development of farming intensification, scale and intelligence, the
management quality and welfare healthier farming requirements of sheep farming are
increasing, and sheep identification is becoming more and more important to prevent
diseases and improve sheep growth. How to quickly and efficiently complete the individual
identification of sheep has become an urgent problem. With the improvement of living
standard, people’s demand for mutton products and goat milk products is increasing, while
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Animals 2022, 12, 1465. https://doi.org/10.3390/ani12111465
https://www.mdpi.com/journal/animals
Animals 2022, 12, 1465
2 of 15
the quality requirements are also getting higher and higher. Therefore, large-scale farming
to improve the productivity and production level of the meat and dairy industry is an
effective way to increase farmers’ income, ensure food safety, improve the ability to prevent
and control epidemics, and achieve the coordinated development of animal husbandry
and the environment. Sheep identification is the basis of intelligent farm management, and
currently common techniques for individual identification, such as hot iron tagging [1],
freeze tagging [2], and ear notching [3], are available. These methods are prone to cause
serious physical damage to animals [4]. Electronic identification, using methods like
Radio Frequency Identification (RFID) tags, are widely used in the field of individual
livestock identification [5]. However, this method is costly and prone to problems such
as unreadability or fraud. To solve the above problems, experts have started working on
new approaches by using deep learning methods. There has been an increase in the use
of deep learning algorithms in the animal industry for different types of applications [6].
These applications include the biometric identification [7], activity monitoring [8], body
condition scoring [9] and behavior analysis [10]. Recently, utilizing biometric traits instead
of traditional methods for identifying individual animals has gained attention, due to the
development of Convolutional Neural Networks (CNNs) [11–13]. To apply deep learning
techniques to biometric recognition, it is necessary to acquire image information of specific
body parts. Currently, a sheep retina [14], cattle body pattern [15] and the face of the
pig have been used. However, obtaining clear data on the retina or body parts is very
difficult [16]. Therefore, this paper proposes a contactless sheep face recognition method
based on the improved YOLOv3. The contactless identification method-based biometric
features was uniqueness. This method is not easily forged or lost, making contactless
identification methods easier to use, more reliable and more accurate. Non-contact animal
identification methods mainly include nasal print, iris recognition, etc. Compared with
the contact identification method (such as ear incision, hot iron branding, freezing hit
number, etc.), the application range is wide, the identification distance requirement is
low, the operation is also relatively simple. Animal image data can be obtained by using
portable photo devices such as cell phones and cameras, and the acquisition cost is low,
saving a lot of human resources. The non-contact identification method does not require
human–animal contact, which maximizes the avoidance of harm to animals and facilitates
their growth and development. Biometric-based identification methods are mostly used
in the identification of animals. The feature extraction method using Gabor filter has
been attempted by Tharawat et al. Support vector machine classifiers with different
kernels (Gaussian, polynomial, linear) [17] were compared. The results show that the
classifier-based on Gaussian kernel is able to achieve 99.5% accuracy in the problem of
nose-print recognition. However, the nasal pattern acquisition process is more difficult and
the captured images contain much redundant information, so the method is not suitable
for large-scale animal identification [18]. The iris of animals contains spots, filaments,
crowns, stripes and other shape features, the combination of which does not change once
the animal is born and is unique, so it can be one of the essential features for individual
identification. Several studies have investigated the iris features of cattle using a twodimensional complex wavelet transform feature approach. The iris images of cattle were
captured and experimented with using a contactless handheld device, and the recognition
accuracy reached 98.33% [19]. However, due to the large size of the iris capture device, the
cost is high and the reliability of the recognition may be reduced due to image distortion
caused by the lens. Therefore, it is not possible to extend the use of this method on a large
scale. Simon et al. found that the human eye has a unique vascular pattern and that the
retinal vessels are as unique as fingerprints. The retina is also present in all animals and is
thus considered to be a unique individual biological characteristic. Rusk et al. [20] used
retinal images of captured cattle and sheep to compare the images. The results showed
that the recognition rate reached 96.2%. However, if the retina is damaged, the recognition
effect will be greatly reduced, or even impossible.
Animals 2022, 12, 1465
3 of 15
The face is the most direct information about the external characteristics of an individual animal. Facial recognition is promising since it contains many significant features,
e.g., eyes, nose, and other biological characteristics [21,22]. Due to the variability and
uniqueness of facial features, the face of animals can become the identifier of individual
recognition. Currently, the facial recognition technology for human has become increasingly sophisticated, but less research has been done on animals. Some institutions and
researchers have been working on the research of animal facial recognition through unremitting efforts and continuous attempts. The migration of facial recognition technology
to animals has been applied with good results. It provides a new idea for the study of
facial recognition of livestock and poultry. Corkery et al. performed recognition of sheep
faces and thus identified individual sheep. The overall analysis was performed using independent component techniques and then evaluated using a pre-classifier, with recognition
rates of 95.3–96% [23]. However, fewer sheep face images were studied in this method,
and the performance of the algorithm was not evaluated on a large sheep face dataset.
Yang obtained more accurate results by analyzing the faces of sheep and introducing triple
interpolation features in cascade pose regression [24]. However, changes in head posture
or severe occlusion during the experiment can also lead to recognition failure. In 2018,
Hansen et al. [13] used an improved convolutional neural network to extract features from
pig face images and identify ten pigs. Due to the inconspicuous color and five senses
of individual pigs, the overall environment is more complex and there are more other
disturbing factors. This resulted in a low identification rate of individual pigs. Ma et al.
proposed a Faster-RCNN neural network model based on the Soft-NMS algorithm. The
method enables real-time detection and localization of sheep under complex breeding
conditions. It improves the accuracy of recognition while ensuring the speed of detection.
This model was able to detect sheep with 95.32% accuracy and mark target locations in
real time, providing an effective database for sheep behavior studies [25]. Almog Hitelman
et al. applied Faster R-CNN to localize sheep faces in images, providing the detected
faces as input to seven different classification models. The best performance was obtained
using the ResNet50V2 model with ArcFace loss function, and the application of migratory
learning methods resulted in a final recognition accuracy of 97% [26]. However, in most
studies, the generation of high levels of noise and large amounts of data due to different
light sources and quality of captured images can also pose a challenge to the recognition
process [27]. The sheep to be identified are in an environment with complex backgrounds,
and the position and pose of the sheep face can also affect the recognition effect. The above
method demonstrates that it is feasible to apply animal facial images to individual animal
identification. In addition, there is still much room for improvement in data acquisition,
data processing, recognition algorithms, and the accuracy of recognition.
In order to solve the above problems, this paper collected images of sheep faces in
natural environment and produced a dataset. In addition, the sheep face images were
data-enhanced before training to solve the overfitting problem caused by the small amount
of data. The anchor boxes in YOLOv3 are clustered for the dataset of this study to improve
the accuracy of model identification. Then, the model is compressed to save resources
and make it easy to deploy, so that the final model is only 1/4 of the initial model. The
accuracy is also higher than the initial model. The model is effectively simplified while
ensuring the detection accuracy. By establishing a dataset of images of sheep and realizing
the training and testing of the model. The results show that the method can greatly improve
the detection and identification efficiency, which is expected to make China’s livestock
industry go further toward achieving intelligent development, help ranchers optimize
sheep management and create economic benefits for the sheep farming industry.
2. Materials and Methods
2.1. Research Objects
In this experiment, we used Sunit sheep, a sheep breed located in Xilin Gol League,
Inner Mongolia, as the subject. The face of this breed of sheep is mostly black and white, so
2. Materials and Methods
2.1. Research Objects
Animals 2022, 12, 1465
In this experiment, we used Sunit sheep, a sheep breed located in Xilin Gol League,
Inner Mongolia, as the subject. The face of this breed of sheep is mostly black and white,
4 of 15
so it is also called “panda sheep”. Due to the timid nature of sheep, people cannot
get
close enough, making it more difficult to photograph sheep faces data. Prior to the experiment, the sheep were herded from their living area into a closed environment around
is also
“panda
sheep”. Due to the timid nature of sheep, people cannot get close
themitand
leftcalled
to calm
down.
enough, making it more difficult to photograph sheep faces data. Prior to the experiment,
the sheep were herded from their living area into a closed environment around them and
2.2. Experimental Setup
left to calm down.
The experiment was completed by the experimenter holding the camera equipment,
2.2. Experimental
as shown
in Figure 1.Setup
The enclosed environment in which the sheep were placed was reccompleted
the experimenter
cameraenclosed
equipment,
tangular, The
5 m experiment
long and 3was
m wide.
Each by
sheep
was taken inholding
turn tothe
another
envias
shown
in
Figure
1.
The
enclosed
environment
in
which
the
sheep
were
placed
was
ronment and filmed individually when they were calm. The deep learning model was
rectangular,
long and environment
3 m wide. Each
wascode
takenbased
in turnon
to the
another
enclosed
implemented
in 5a m
Windows
bysheep
writing
pytorch
frameenvironment and filmed individually when they were calm. The deep learning model was
work. Training was performed on a PC with NVIDA GeForce GTX3090 Ti.
implemented in a Windows environment by writing code based on the pytorch framework.
Training was performed on a PC with NVIDA GeForce GTX3090 Ti.
Figure 1. The layout of the experimental scene.
Figure 1. The layout of the experimental scene.
2.3. Data Collection
2.3. Data Collection
The dataset used in this study was obtained from the Sunit Right Banner Desert
Grassland
Research
Institute
Grassland
Research,
Academy
Agricultural
The
dataset
used Base,
in this
studyofwas
obtained
fromChinese
the Sunit
RightofBanner
Desert
Sciences.
The
collection
date
was
29
June
2021,
within
the
experimental
plots,
and
sheep
Grassland Research Base, Institute of Grassland Research, Chinese Academy of Agriculdata were
eachwas
sheep
turn.
Thewithin
datasetthe
collected
video data
fromand
tural face
Sciences.
The collected
collectionfordate
29 in
June
2021,
experimental
plots,
20 adult Sunit sheep. The videos were recorded at a frame rate of 30 frames per second, a
sheep face data were collected for each sheep in turn. The dataset collected video data
frame width of 1920 pixels, and a frame height of 1080 pixels. In order to make the collected
from dataset
20 adult
Sunit
sheep.
The videos
werelighting
recorded
a frame
rate are
of 30
frames
with
a certain
complexity,
different
andatdifferent
angles
used
for theper
second,
a frame
width ofeach
1920sheep’s
pixels,face
andwas
a frame
height
1080 pixels.
In orderFinally,
to make
shooting.
Therefore,
captured
fromofdifferent
perspectives.
the collected
certain complexity,
different
different
a total ofdataset
20 sheepwith
wereacaptured,
with 1–3 min
of video lighting
capturedand
for each
sheep.angles
In thisare
used study,
for thethe
shooting.
Therefore,
each
sheep’s
faceofwas
captured from different perspecsheep number
was the
ear tag
number
the sheep.
tives. Finally, a total of 20 sheep were captured, with 1–3 minutes of video captured for
2.4. Dataset Creation and Preprocessing
each sheep. In this study, the sheep number was the ear tag number of the sheep.
Follow the steps below to create the sheep face dataset: First, a video of each sheep’s
face
was
manually
with the sheep’s ear tag number. Next, python code was
2.4. Dataset
Creation
and marked
Preprocessing
used to convert each video file into an image sequence. Finally, 1958 sheep face images
Follow
the steps
below toremoving
create thethe
sheep
face
dataset:
videoand
of the
each
sheep’s
were obtained
by manually
images
with
blurredFirst,
sheepa faces
images
face was
manually
marked
withcleaning.
the sheep’s ear tag number. Next, python code was used
without
sheep faces
for data
facilitate
image
annotation
and avoid
the problem
of 1958
overfitting
training
results,
to convertToeach
video
file into
an image
sequence.
Finally,
sheepofface
images
were
structural
similarity
index
measurement
(SSIM)
[28]
was
performed
to
remove
overly
obtained by manually removing the images with blurred sheep faces and the images withsimilar
images.
SSIMcleaning.
measures luminance and contrast. A final comparison of luminance,
out sheep
faces
for data
contrast and structure was performed to analyze the similarity between the two input
images. In addition, the problem of uneven dataset acquisition due to the different characteristics of individual sheep. For example, some sheep are active by nature, so the collected
data are sparse, and some prefer to be quiet, resulting in dense data collection. We can use
Animals 2022, 12, 1465
overly
similar images.
SSIMindex
measures
luminance(SSIM)
and contrast.
A final
comparison
of lusults, structural
similarity
measurement
[28] was
performed
to remove
minance,
contrast
and structure
was performed
to and
analyze
the similarity
between theoftwo
overly similar
images.
SSIM measures
luminance
contrast.
A final comparison
luinput
images.
In addition,
the problem
of uneven
duebetween
to the different
minance,
contrast
and structure
was performed
to dataset
analyzeacquisition
the similarity
the two
characteristics
of addition,
individual
sheep.
For example,
sheep
are active
bytonature,
so the
input images. In
the
problem
of unevensome
dataset
acquisition
due
the different
collected
data
are
sparse,
and
some
prefer
to
be
quiet,
resulting
in
dense
data
collection.
characteristics of individual sheep. For example, some sheep are active by nature, so
the
5 of
15
We
can use
theare
data
enhancement
solve
the resulting
problem of
datacollection.
distribucollected
data
sparse,
and somemethod
prefer to be
quiet,
in uneven
dense data
tion
between
different
classes. Flipping,
rotating
andthe
cropping
some
of the sheep
face imWe can
use the
data enhancement
method
to solve
problem
of uneven
data distribuages
can improve
the generalization
ability
of theand
model
and solve
theofproblem
of face
uneven
tion between
different
classes. Flipping,
rotating
cropping
some
the sheep
imthe data enhancement method to solve the problem of uneven data distribution between
distribution
of
data
classes.
In
other
words,
this
can
prevent
the
model
from
not
learning
ages can improve the generalization ability of the model and solve the problem of uneven
different classes. Flipping, rotating and cropping some of the sheep face images can imenough
features
or learning
a large
numberthis
of individual
features
to degrade
thelearning
model
distribution
of data
classes.
In
other
prevent
the model
from not
prove the generalization
ability
of thewords,
model andcan
solve
the problem
of uneven
distribution
performance.
Some
examplesa of
the number
enhancement
are shown
in Figure
2. A total
ofmodel
2400
enough
features
or
learning
large
of
individual
features
to
degrade
the
of data classes. In other words, this can prevent the model from not learning enough feaimages
of sheep
faceexamples
were collated,
120
images per sheep;
of these,
1600 were
used
to
train
performance.
Some
of
the
enhancement
are
shown
in
Figure
2.
A
total
of
2400
tures or learning a large number of individual features to degrade the model performance.
the
sheep
face recognition
model, 400
were used
to perform
model
validation,
and
the
images
of
sheep
face
were
collated,
120
images
per
sheep;
of
these,
1600
were
used
to
train
Some examples of the enhancement are shown in Figure 2. A total of 2400 images of sheep
remaining
400
were
used
to
evaluate
the
final
performance
of
the
model.
Using
the
labelthe
face recognition
model,
400 were
used to
perform
model
validation,
andface
the
facesheep
were collated,
120 images
per sheep;
of these,
1600
were used
to train
the sheep
ing
tool
for
target
border
annotation,
generate
XML
formatted
annotation
file
after
annoremaining
400
were
used
to
evaluate
the
final
performance
of
the
model.
Using
the
labelrecognition model, 400 were used to perform model validation, and the remaining 400
tation.
The
file
content
includes
the folder
name,
image
name,
image
size,
ing tool
for to
target
border
annotation,
generate
XML
formatted
annotation
fileimage
after
annowere
used
evaluate
the
final performance
of
the
model.
Using
thepath,
labeling
tool
for
sheep
only
category
name,
and
pixel
coordinates
of
the
labeled
box,
etc.
Figure
3
shows
tation.
The
file
content
includes
the
folder
name,
image
name,
image
path,
image
target border annotation, generate XML formatted annotation file after annotation. Thesize,
file
the
labeling
results.
sheep
only
category
name,name,
and pixel
coordinates
of path,
the labeled
Figure
3 shows
content
includes
the folder
image
name, image
image box,
size, etc.
sheep
only category
the
labeling
results.
name,
and pixel
coordinates of the labeled box, etc. Figure 3 shows the labeling results.
Figure 2. Data Augmentation.
Figure 2. Data Augmentation.
Figure 3.
3. Labeling
Labeling results
results of
of dataset.
dataset.
Figure
Figure
3. Labeling
results of dataset.
3. Sheep
Face Recognition
Based on YOLOv3-P
3. Sheep Face Recognition Based on YOLOv3-P
The aim of our research was to recognize sheep faces through intelligent methods.
3. Sheep
Faceof
Recognition
Based
on recognize
YOLOv3-Psheep faces through intelligent methods.
The aim
research
was
Inspired
by faceour
recognition,
we to
tried to apply
deep learning methods to sheep face
Inspired
by
face
recognition,
we
tried
to
apply
deep
learning
methods
to sheep and
face
recogThe
aim
of
our
research
was
to
recognize
sheep
faces
through
intelligent
methods.
recognition. In this section, we introduce the proposed
K-means
clustering
model
nition.
In
this
section,
we
introduce
the
proposed
K-means
clustering
and
model
compresInspired
by
face
recognition,
we
tried
to
apply
deep
learning
methods
to
sheep
face
recogcompression methods.
sion
methods.
nition.
In this section, we introduce the proposed K-means clustering and model compres3.1. Overview
sion
methods.of the Network Framework
With the rapid development of deep learning, target detection networks have shown
excellent performance in part recognition. Inspired by deep learning-based face recognition
methods, we improve the YOLOv3 target detection network to achieve facial recognition of
sheep. In this work, we propose the YOLOv3-P network to optimize the network in terms
of recognition accuracy and model size, respectively.
Animals 2022, 12, 1465
6 of 15
3.2. Sheep Face Detection Based on YOLOv3
22, 12, x FOR PEER REVIEW
The YOLOv3 [29] network model is one of the object detection methods that evolved
from YOLO [30] and YOLOv2 [31]. The YOLOv3 neural network was derived from the
Darknet53 and consists of 53 convolutional layers, which uses skip connections network
inspired from ResNet. It has shown the state-of-the-art accuracy but with fewer floating
points and better speed. YOLOv3 uses residual skip connection to achieve accurate recognition of small targets through upsampling and cascading. It is similar to the way the feature
pyramid network performs detection at three different scales. In this way, YOLOv3 can
detect targets of any size. After feeding an image into the YOLOv3 network, the detected
information, i.e., bbox coordinates and objectness scores, as well as category scores, are
output from the three detection layers. The predictions of the three detection layers are
combined and processed using a non-maximum suppression method, and then the final
detection results are determined. Unlike the Faster-R-CNN [32], YOLOv3 is a single-stage
detector that formulates the detection problem as a regression problem. YOLOv3 can detect
multiple objects by reasoning at once, so the detection speed is very fast. In addition, by
applying a multistage detection method, it is able to compensate for the low accuracy of
YOLO and YOLOv2. The YOLOv3 framework is shown in Figure 4. The input image is
divided into n × n grids. If the center of a target falls within a grid, that grid is responsible
for detecting that target. The YOLOv3 network achieves target detection by clustering the
COCO dataset as candidate frames. The COCO dataset contains 20 categories of targets,
ranging in size from large buses and bicycles to small cats and birds. If the parameters
of the anchor frame are directly used for the training of the sheep face image dataset, the
detection results will be poor. Therefore, K-means algorithm is used to recalculate the
size of the anchor frame based on the sheep face image dataset. A comparison of the
clustering results is shown in Figure 5. The red pentagrams in the figure are the anchor
frame scales after clustering, while the green pentagrams represent the anchor frame scales
before clustering. Set the anchor boxes of the collected sheep face dataset after
clustering
7 of
16
to (41, 74), (56, 104), (71, 119), (79, 146), (99, 172), (107, 63), (119, 220), (156, 280), (206, 120).
Each feature map is assigned 3 different sizes of anchor boxes for bounding box prediction.
Figure 4. YOLOv3 structure. CBL is the smallest component of the YOLOv3 network architecture, which consists of Conv (convolution) + BN + Leaky relu; Res unit is the residual component;
Figure 4. YOLOv3
is the smallest
YOLOv3
network
architecture,
ResX,structure.
X stands CBL
for number,
there arecomponent
Res1, Res2, of
. . .the
, Res8,
etc., which
consists
of a CBL and N
which consists residual
of Convcomponents.
(convolution) + BN + Leaky relu; Res unit is the residual component; ResX,
X stands for number, there are Res1, Res2, … , Res8, etc., which consists of a CBL and N residual
components.
Animals 2022, 12, 1465
Figure 4. YOLOv3 structure. CBL is the smallest component of the YOLOv3 network architecture,
which consists of Conv (convolution) + BN + Leaky relu; Res unit is the residual component; ResX,
X stands for number, there are Res1, Res2, … , Res8, etc., which consists of a CBL and N residual
7 of 15
components.
Figure 5. Clustering results of anchor boxes. The red pentagrams in the figure are the anchor frame
Figure
5. Clustering
results
ofthe
anchor
scales
after clustering,
while
green boxes.
pentagrams represent the anchor frame scales before clustering.
Set the anchor boxes of the collected sheep face dataset after clustering to (41, 74), (56, 104), (71, 119),
(79,
146), (99,the
172),
(107, by
63),Pruning
(119, 220), (156, 280), (206, 120).
3.3. Compress
Model
YOLOv3
has
greater
advantage in sheep face recognition accuracy compared with
3.3.
Compress
theaModel
by Pruning
other target
detection
networks,
but it in
is sheep
still difficult
to meetaccuracy
the requirements
of most
YOLOv3
has a greater
advantage
face recognition
compared with
embedded
devices
in
terms
of
model
size.
Therefore,
model
compression
is
needed
to reother target detection networks, but it is still difficult to meet the requirements of most emducebedded
the model
parameters
make
easy to deploy.
The methods
of model
compresdevices
in terms of and
model
size.itTherefore,
model compression
is needed
to reduce
model
parameters
and make[33,34],
it easy to
deploy. The methods
of model compression
sionthe
are
divided
into pruning
quantization
[35,36], low-rank
decomposition
are divided
into pruning
[33,34], quantization
[35,36],has
low-rank
decomposition
[37,38]
and
[37,38]
and knowledge
distillation
[39,40]. Pruning
become
a hot research
topic
in acknowledge
distillation
[39,40].
Pruning
has
become
a
hot
research
topic
in
academia
and
ademia and industry because of its ease of implementation and excellent results. Fan et al.
industry because of its ease of implementation and excellent results. Fan et al. simplify
simplify and speed up the detection by using channel pruning and layer pruning methods
and speed up the detection by using channel pruning and layer pruning methods for the
for the
YOLOv4
network
This the
reduces
theYOLOv4
trimmed
YOLOv4
model
sizetime
and inYOLOv4
network
model.model.
This reduces
trimmed
model
size and
inference
ference
time MB
by 241.24
MBms,
and
10.82 ms,The
respectively,
Theprecision
mean average
precision
(mAP)
by 241.24
and 10.82
respectively,
mean average
(mAP) also
improved
alsofrom
improved
from
91.82%
to
93.74%
[41].
A
pruning
algorithm
combining
channel
91.82% to 93.74% [41]. A pruning algorithm combining channel and layer pruning and
layer
for model was
compression
was
by Lvshow
et al.aThe
show
a 97%
forpruning
model compression
proposed by
Lvproposed
et al. The results
97%results
reduction
in the
size of the
original
the speed
of while
reasoning
a factor ofincreased
1.7. At theby a
reduction
in the
size model,
of the while
original
model,
the increased
speed ofby
reasoning
same
there
almost
no loss
of precision
Shaoofetprecision
al. also proposed
a method
factor
of time,
1.7. At
theissame
time,
there
is almost[42].
no loss
[42]. Shao
et al. also
combining
network
pruning
and
YOLOv3
for
aerial
IR
pedestrian
detection;
Smooth-L1
proposed a method combining network pruning and YOLOv3 for aerial IR pedestrian
regularization is introduced on the channel scale factor; This speeds up reasoning while
detection; Smooth-L1 regularization is introduced on the channel scale factor; This speeds
ensuring detection accuracy. Compared with the original YOLOv3, the model volume
up reasoning
while ensuring detection accuracy. Compared with the original YOLOv3,
is compressed by 228.7 MB, and the model AP is reduced by only 1.7% [43]. The above
the model
is compressed
MB,
andensuring
the model
APquality
is reduced
by only 1.7%
methodvolume
can effectively
simplify by
the228.7
model
while
image
and detection
[43].accuracy,
The above
method
can
effectively
simplify
the
model
while
ensuring
image
quality
providing a reference for the development of portable mobile terminals. In this
andarticle,
detection
accuracy,
providing
of portable
inspired
by web
thinning,a itreference
proves tofor
bethe
an development
effective pruning
method. mobile
So we terchoose
a
pruning
algorithm
that
combines
channel
and
layer
pruning
to
compress
the
minals. In this article, inspired by web thinning, it proves to be an effective pruning
network
model.
method. So we choose a pruning algorithm that combines channel and layer pruning to
compress the network model.
3.3.1. Channel Pruning
Channel pruning is a coarse-grained but effective method. More importantly, the
pruned model can be easily implemented without the need for dedicated hardware or
ChannelChannel
pruning
is a coarse-grained
butis effective
importantly, the
software.
pruning
for deep networks
essentiallymethod.
based onMore
the correspondence
pruned
model
can channels
be easilyand
implemented
without
need
dedicated
between
feature
convolutional
kernels.the
Crop
off afor
feature
channelhardware
and the or
corresponding
convolution
kernel
to
achieve
model
compression.
Suppose
the
i-th
convosoftware. Channel pruning for deep networks is essentially based on the correspondence
3.3.1. Channel Pruning
between feature channels and convolutional kernels. Crop off a feature channel and the
Animals 2022, 12, 1465
corresponding convolution kernel to achieve model compression. Suppose the i-th convolutional layer of the network is xi, and the dimension of the input feature map of this
convolutional layer is (β„Žπ‘– , 𝑀𝑖 , 𝑛𝑖 ), where 𝑛𝑖 is the number of input channels, and8 β„Žof𝑖 ,15𝑀𝑖 denote the height and width of the input feature map respectively. The convolution layer
transforms the input feature map π‘₯𝑖 ∈ 𝑅𝑛𝑖 +β„Žπ‘–+𝑀𝑖 into the output feature map π‘₯𝑖+1 ∈
layer
network is𝑛 xi , and
the dimension of𝑛the
input
map of this
𝑖+1 ×β„Žπ‘–+1 ×𝑀
𝑖+1 of
𝑖 ×π‘˜×π‘˜
𝑅𝑛lutional
bythe
applying
onfeature
𝑛𝑖 channels,
andconvois used as
𝑖+1 3D filters 𝐹𝑖,𝑗 ∈ 𝑅
lutional layer is (hi , wi , ni ), where ni is the number of input channels, and hi , wi denote the
the input feature map for the next convolutional layer. Each of these filters consists of 𝑛𝑖
height and width of the input feature
π‘˜×π‘˜ map respectively. The convolution layer transforms
2Dtheconvolution
kernels
KR∈
matrix 𝐹𝑖 ∈
ni +𝑅hi +wi , and all filters together form nthe
input
feature
map
x
∈
into the output feature map xi+1 ∈ R i+1 ×hi+1 ×wi+1
i
𝑖 ×𝑛𝑖+1 ×π‘˜×π‘˜ . As shown in Figure 6,
ni ×suppose
k×k
𝑅𝑛by
the
current
convolutional
layer
is
computed
applying ni+1 3D filters Fi,j ∈ R
on ni channels, and is used as the input feature
2
as map
𝑛𝑖+1for
𝑛𝑖 π‘˜the
β„Žπ‘–+1
𝑀𝑖+1
. When the filter
pruned,
its corresponding
map π‘₯𝑖+1,𝑗
next
convolutional
layer. 𝐹
Each
these filters
consists of ni 2Dfeature
convolution
𝑖,𝑗 isof
2
k×k
ni × ni + 1 × k × k
will
be deleted.
reduces
π‘˜ β„Žπ‘–+1 𝑀form
of computation.
At the
kernels
K ∈ R This
, and
all filters𝑛𝑖together
the matrix
Fi ∈ R
. Assame
showntime,
in the
𝑖+1 times
2
Figure
6, suppose thetocurrent
layerwill
is computed
as ni+1 ni kwhich
hi+1 wagain
filter
corresponding
π‘₯𝑖+1,𝑗 convolutional
in the next layer
also be pruned,
additioni+1 . When
the
filter Fi,jthe
is pruned,
its corresponding
feature
willmbefilters
deleted.
This reduces
+1,jthe
ally
reduces
computation
of 𝑛𝑖+1 π‘˜ 2 β„Žπ‘–+2
𝑀𝑖+2 .map
Thatxiis,
in layer
i are pruned,
2
n
k
h
w
times
of
computation.
At
the
same
time,
the
filter
corresponding
towhich
xi+1,j infinally
1 i+1
andi thei+computational
cost of layers i and j can be reduced at the same time,
the nextthe
layer
will also
pruned, which
reduces
the computation of
achieves
purpose
of be
compressing
theagain
deep additionally
convolutional
network.
2
ni+1 k hi+2 wi+2 . That is, the m filters in layer i are pruned, and the computational cost
of layers i and j can be reduced at the same time, which finally achieves the purpose of
compressing the deep convolutional network.
Figure 6. Channel pruning principle. xi denotes the i-th convolutional layer of the network; hi , wi
denote the height and width of the input feature map respectively; ni is the number of input channels
Figure
6.convolutional
Channel pruning
of this
layer. principle. π‘₯𝑖 denotes the i-th convolutional layer of the network; β„Žπ‘– , 𝑀𝑖
denote the height and width of the input feature map respectively; 𝑛𝑖 is the number of input
3.3.2. Layer
channels
of thisPruning
convolutional layer.
The process of layer pruning for the target detection network is similar to that of
channel
pruning.
3.3.2.
Layer
PruningLayer pruning cannot depend only on the principle that the number
of convolutional kernels in adjacent convolutional layers is preserved equally. Because
The process of layer pruning for the target detection network is similar to that of
one of the convolutional layers at each end of the shortcut structure in YOLOv3 is cut
channel pruning. Layer pruning cannot depend only on the principle that the number of
off, it not only affects itself, but also has a serious impact on the whole shortcut structure,
convolutional
kernels
in adjacent
convolutional
layers is preserved
equally.
which can cause
the network
to fail
to perform operations.
Consequently,
oneBecause
of the one
of convolutional
the convolutional
each
end
of thestructure
shortcutinstructure
YOLOv3
is only
cut off, it
layers layers
at eachat
end
of the
shortcut
YOLOv3 in
is cut
off; it not
notaffects
only itself,
affects
itself,
butaalso
hasimpact
a serious
impact
the whole
shortcut
structure,
but
also has
serious
on the
wholeon
shortcut
structure,
which
can causewhich
network
fail to perform
operations.
a shortcut
structure is determined
be
canthecause
the to
network
to fail to
performWhen
operations.
Consequently,
one of thetoconvoluunimportant,
the
entire
shortcut
structure
is
cut
off.
Due
to
the
small
amount
of
operations
tional layers at each end of the shortcut structure in YOLOv3 is cut off; it not only affects
in the
shallow
network,
instead
of pruning
first shortcut
shortcut structure,
start from
the the
itself,
but
also has
a serious
impact
on the the
whole
structure,we
which
can cause
second shortcut structure and the retrograde layer pruning operation. To begin with, the
network to fail to perform operations. When a shortcut structure is determined to be unmodel is trained sparsely, and the stored BN layer γ parameters are traversed in the second
important, the entire shortcut structure is cut off. Due to the small amount of operations
shortcut structure, starting with a 3 × 3 convolution kernel. In addition, the L1 parameters
in of
theγ shallow
network,
of pruning
first shortcut structure, we start from the
of each BN
layer areinstead
compared
and the Lthe
1 parameters of γ of the 2nd convolutional
second
shortcut
structure
and
the
retrograde
layer
pruning
begin
kernel BN layer in these 22 shortcut structures are ranked.
Theoperation.
second BN To
layer
s for with,
the the
N
c
model
is trained
sparsely,
the stored
traversed
in the seci-th shortcut
structure
can and
be expressed
as: BN
Li =layer
denotes
the number
|, where: Nc are
∑s=1 |γγsparameters
ond
with
a
3
×
3
convolution
kernel.
In
addition,
the
L1
paramof shortcut
channels.structure,
Li is the Lstarting
norm
of
the
second
BN
layer
y
parameter
of
the
i-th
Shortcut,
l
which
the importance
of L1
the parameters
Shortcut structure.
take convoluthe
eters
of γindicates
of eachthe
BNmagnitude
layer are of
compared
and the
of γ ofNext,
the 2nd
smaller
Ln (nBN
is the
number
of shortcut
structures
to be subtracted),
theThe
shortcut
structure
tional
kernel
layer
in these
22 shortcut
structures
are ranked.
second
BN layer s
𝑁𝑐 two convolutional layers
cuti-th
out shortcut
accordingly,
where each
shortcut
structure
foristhe
structure
can be
expressed
as: includes
𝐿𝑖 = ∑𝑠=1
|𝛾𝑠 |, where: 𝑁𝑐 denotes the
(1 × 1 convolutional layer and 3 × 3 convolutional layer), hereby
the corresponding 2 × Ln
number of channels. 𝐿𝑖 is the Ll norm of the second BN layer y parameter of the i-th
convolutional layers are clipped. Figure 7 depicts the layers before and after pruning.
Shortcut, which indicates the magnitude of the importance of the Shortcut structure. Next,
take the smaller Ln (n is the number of shortcut structures to be subtracted), the shortcut
structure is cut out accordingly, where each shortcut structure includes two convolutional
Animals 2022, 12, x FOR PEER REVIEW
Animals 2022, 12, x FOR PEER REVIEW
Animals 2022, 12, 1465
9 of 15
9 of 15
layers (1 × 1 convolutional layer and 3 × 3 convolutional layer), hereby the corresponding
2× Ln convolutional layers are clipped. Figure 7 depicts the layers before and after prun9 of 15
layers (1 × 1 convolutional layer and 3 × 3 convolutional layer), hereby the corresponding
ing.
2× Ln convolutional layers are clipped. Figure 7 depicts the layers before and after pruning.
Figure 7. Before and after layer pruning.
Figure 7. Before and after layer pruning.
Figure 7. Before and after layer pruning.
3.3.3. Combination of Layer Pruning and Channel Pruning
3.3.3. Combination of Layer Pruning and Channel Pruning
Channel pruning
has significantly
model parameters and computation,
3.3.3. Combination
of Layer
Pruning andreduced
Channelthe
Pruning
Channel pruning
has significantly
reduced
the
model parameters and computation,
and reduced the model’s resource consumption, and layer pruning can further reduce the
andChannel
reduced pruning
the model’s
consumption,
and
layer parameters
pruning canand
further
reduce the
has resource
significantly
reduced the
model
computation,
computation effort and improve the model inference speed. By combining layer pruning
computation
effort
and
improve
the
model
inference
speed.
By
combining
layer
pruning
and reduced the model’s resource consumption, and layer pruning can further reduce
the
and channel pruning, the depth and width of the model can be compressed. In a sense,
and channeleffort
pruning,
depththe
and
width
of the model
compressed.
a sense,
computation
and the
improve
model
inference
speed.can
Bybe
combining
layerIn
pruning
small model exploration for different datasets is achieved. Therefore, this paper proposes
small
modelpruning,
exploration
different
datasets
is achieved.
Therefore,
this paper
and
channel
the for
depth
and width
of the
model can
be compressed.
In aproposes
sense,
to perform channel pruning and layer pruning by performing both on the convolutional
to perform
pruning
and layer
pruning
by performing
both on
convolutional
small
model channel
exploration
for different
datasets
is achieved.
Therefore,
thisthe
paper
proposes
layer. Find
Find more
morecompact
compactand
andefficient
efficient
convolutional
layer
channel
configurations
to reconvolutional
layer
channel
to reduce
tolayer.
perform channel
pruning and
layer pruning
by performing
bothconfigurations
on the convolutional
duce
training
parameters
and
speed
up
training.
For
this
purpose,
both
channel
pruning
training
and speed
up training.
For this purpose,
both configurations
channel pruning
and
layer.
Findparameters
more compact
and efficient
convolutional
layer channel
to reand
layer
pruning
were
applied
in YOLOv3-K-means
to
obtain
thenew
newmodel.
model. The
The model
model
layer
pruning
were
applied
in
YOLOv3-K-means
to
obtain
the
duce training parameters and speed up training. For this purpose, both channel pruning
pruning process
process is shown
shown in
in Figure
Figure 8.
8. The
pruning
ratio
is
in 0.05
0.05 steps
steps
pruning
The optimal
optimalto
pruning
ratio
is found
found
and
layer pruning is
were applied
in YOLOv3-K-means
obtain the
new
model.in
The model
incrementally
until the
mAP
reaches
the
value
of effective compression,
incrementally
the model
model
mAP 8.
reaches
the threshold
threshold
value
compression,
pruning
processuntil
is shown
in Figure
The optimal
pruning
ratioofiseffective
found in
0.05 steps
and
the
model
pruning
ends.
We
finally
chose
a
cut
channel
ratio
of
0.5.
In
addition
to
and
the
model
pruning
ends.
We
finally
chose
a
cut
channel
ratio
of
0.5.
In
addition
to that,
incrementally until the model mAP reaches the threshold value of effective compression,
that,
we
also
cut
out
12
shortcuts,
that
is,
36
layers.
After
the
pruning
is
over,
it
does
not
we the
alsomodel
cut outpruning
12 shortcuts,
36 layers.
pruning
is of
over,
does
not mean
and
ends. that
We is,
finally
choseAfter
a cut the
channel
ratio
0.5.itIn
addition
to
mean
the
end
of
the model
pruning
work,
as
the weight
of
the clipped
sparse
model
part
the
end
of
the
model
pruning
work,
as
the
weight
of
the
clipped
sparse
model
part
set
that, we also cut out 12 shortcuts, that is, 36 layers. After the pruning is over, it does isnot
is
set
toit0.isIfused
it is directly
used directly
for detection
tasks,
its model
accuracy
will
be greatly
afto
0.
If
for
detection
tasks,
its
model
accuracy
will
be
greatly
affected.
mean the end of the model pruning work, as the weight of the clipped sparse model part
fected. Consequently,
theneeds
model
to betrained
furtherontrained
on theset
training
set to
theused
model
to needs
be
further
the training
recover
therecover
model
isConsequently,
set to 0. If it is
directly
for
detection
tasks, its
model
accuracyto
will
be greatly
afthe model performance,
a process
called
iterative tuning.
performance,
a
process
called
iterative
tuning.
fected. Consequently, the model needs to be further trained on the training set to recover
the model performance, a process called iterative tuning.
Figure 8. Flowchart of network slimming procedure.
Figure
8. Flowchart Evaluation
of network slimming
procedure.
3.4. Experimental
Index
3.4. Experimental Evaluation Index
To evaluate
the
of
To
evaluateEvaluation
the performance
performance
of the
the model,
model, the
the model
model was
was compared
compared with
with several
several
3.4. Experimental
Index
common deep
learning-based object
terms of
of accuracy,
accuracy, size
of training
training
common
deep learning-based
object detection
detection models
models in
in terms
size of
To evaluate
the performance
model,
the model
was
compared
weights,
For of
thethe
object
detection
model,
accuracy
and with
recallseveral
are two
two
weights, and
and inference
inference speed.
speed. For
the
object
detection
model,
accuracy
and
recall
are
common
deep
learning-based
object
detection
models
in
terms
of
accuracy,
size
ofdecreases
training
relatively
contradictory
metrics.
As
the
recall
rate
increases,
the
accuracy
rate
relatively contradictory metrics. As the recall rate increases, the accuracy rate decreases
weights,
and inference
speed.
Forneed
the object
detection
accuracy and recall
are two
and vice
The two
two
metrics
to be
be traded
traded
offmodel,
for different
Adjustment
and
vice versa.
versa. The
metrics
need to
off
for
different situations.
situations. Adjustment
relatively
contradictory
metrics.
As the recall
rate increases,
the
accuracy the
ratetwo
decreases
is
made
by
model
detection
confidence
thresholds.
In
order
to
evaluate
is made by model detection confidence thresholds. In order to evaluate the two metrics
metrics
and
vice versa.
two
metrics
to be traded
off for different
situations.
Adjustment
together,
the F1The
score
metric
wasneed
introduced.
It is calculated
as in Equation
(3).
In addition
is made by model detection confidence thresholds. In order to evaluate the two metrics
Animals 2022, 12, 1465
10 of 15
to this, the average accuracy value of all categories, mAP, is used to evaluate the accuracy
of the model, as shown in Equation (4).
P=
TP
TP + FP
(1)
R=
TP
TP + FN
(2)
2∗P ∗ R
P+R
(3)
1 M
AP(k)
M k∑
=1
(4)
F1 − score =
mAP =
where TP is the number of true positives, FP is the number of false positives, P is the
precision, R is recall, and F1-score is a measure of test accuracy, which combines precision P
and recall R to calculate the score. AP(k) refers to the average precision of the kth category,
whose value is the area under the P(precision)–R(recall) curve; M represents the total
number of categories, and mAP is the average of all categories precision.
4. Experimental Results
4.1. Result Analysis
4.1.1. Experiment and Analysis
Model training and testing were performed using the Windows operating system.
The test framework is PyTorch 1.3.1 (Facebook AI Research, New York, NY, USA) (CPU
is Intel(R) Xeon(R) Silver 4214R, 128GB RAM, NVIDA GeForce GTX3090 Ti) and the
parallel computing framework is CUDA (10.1, NVIDIA, Santa Clara, CA, USA). The
hyperparameters of the model were set to 16 samples per batch with an initial learning rate
of 0.001. The model saves weights every 10 iterations for a total of 300 iterations. As the
training proceeds, the loss function continues to decrease. By the time the iteration reaches
300, the loss function is no longer decreasing.
In this paper, some experiments are conducted to compare the improved models
with different backbone networks of Faster-RCNN, SSD, YOLOv3 and YOLOv4. The
performance of each model was validated on the same dataset. Retain the model generated
after 300 training sessions are completed, and the profile corresponding to the model. Test
the mAP, Precision, Recall, Speed and model parameters of the above network model with
the same test code with the same parameters and record the test results. Figure 9 represents
the changes of each evaluation criterion of the proposed models during the training process.
It can be seen that the accuracy and loss values of the model gradually level off as the
number of iterations increases. Figure 10 shows the recognition effect of the improved
model for sheep faces.
4.1.2. Comparison of Different Networks
In order to select a more suitable model for improvement, models based on different
backbone networks were compared. Table 1 shows the detection results of different models
on the same sheep face dataset. Compared with the CSP Darknet53-based YOLOv4 model,
the Darknet53-based YOLOv3 model was 4.15% and 1.20% higher in mAP and F1-score,
respectively. Compared with Faster-RCNN, a two-stage detection model based on the
Resnet network, in mAP and F1-score were 5.10% and 4.70% higher, respectively. YOLOv3
as a single-stage detection algorithm also outperforms the two-stage detection algorithm
Faster-RCNN in terms of speed. Lastly, YOLOv3 is compared with the SSD model based
on the VGG19 network. Both the mAP and F1-score of YOLOv3 are slightly lower than
those of the SSD model. However, there are many more candidate frames in the SSD
model than YOLOv3, resulting in a much slower detection speed than YOLOv3. After
comparing all aspects, we finally chose the YOLOv3 model for optimization. In addition,
Animals 2022, 12, 1465
11 of 15
Animals 2022, 12, x FOR PEER REVIEW
Animals 2022, 12, x FOR PEER REVIEW
11 of 15
11 of 15
we performed K-means clustering of anchor frames in YOLOv3 based on the sheep face
dataset. Compared with the initial model, the mAP improved by 1.1%.
Figure
9.
during
the
training
process.
Figure
9.
Changesin
evaluation criteria
criteria
training
process.
Figure
9. Changes
Changes
ininevaluation
evaluation
criteriaduring
duringthe
the
training
process.
Figure
10.
Recognition
Figure
Recognitionresult.
result.
Figure
10.10.
Recognition
result.
Table 1. Comparison
of different network
detection results.
4.1.2.
4.1.2. Comparison
Comparison of
of Different
Different Networks
Networks
In
to
model
models
on
Model
mAP suitable
Precision
Recall
F1-Score
Parameters
In order
order
to select
select aa more
more
suitable
model for
for improvement,
improvement,
models based
based
on different
different
backbone
networks
were
compared.
Table
1
shows
the
detection
results
of
different
modFasternetworks
R-CNN were
90.20%
84.00%
MB
backbone
compared.80.63%
Table 1 shows90%
the detection
results of 108
different
models
on
the
same
sheep
face
dataset.
Compared
with
the
CSP
Darknet53-based
YOLOv4
SSD
98.73%
96.85%
96.25%
96.35%
100
MB
els on the same sheep face dataset. Compared with the CSP Darknet53-based YOLOv4
YOLOv3
95.30%YOLOv3
82.90%
95.70%
88.70%
235
MB and F1model,
Darknet53-based
model
4.15%
higher in
mAP
model, the
the
Darknet53-based
model was
was88.00%
4.15% and
and 1.20%
1.20%
mAP
YOLOv4
91.15%YOLOv3
88.70%
87.50%higher in
244
MB and F1score,
score, respectively.
respectively. Compared
Compared with
with Faster-RCNN,
Faster-RCNN, aa two-stage
two-stage detection
detection model
model based
based on
on
the
Resnet
network,
in
mAP
and
F1-score
were
5.10%
and
4.70%
higher,
respectively.
the4.1.3.
Resnet
network,Analysis
in mAPofand
F1-score
were
5.10% and 4.70% higher, respectively.
Comparative
Different
Pruning
Strategies
YOLOv3
as
a
single-stage
detection
algorithm
also
outperforms
the
two-stage
detection
YOLOv3
as
a single-stage
detection
algorithm
also
outperforms
the YOLOv3
two-stage
detection
In
order
to
evaluate
the
performance
of
the
proposed
model,
the
model
algorithm
Faster-RCNN
in
terms
of
speed.
Lastly,
YOLOv3
is
compared
with
the
algorithm
Faster-RCNN
in
terms
of
speed.
Lastly,
YOLOv3
is
compared
with
the SSD
SSD
based on the clustering of sheep face dataset is pruned to different degrees, which includes
model
based
on
the
VGG19
network.
Both
the
mAP
and
F1-score
of
YOLOv3
are
slightly
model
based
on thelayer
VGG19
network.
Both the mAP
and F1-score
of YOLOv3
areThe
slightly
channel
pruning,
pruning
and simultaneous
pruning
of channels
and layers.
lower
than
of
the
there
candidate
frames
lower
than
those
of fine-tuning
the SSD
SSD model.
model.
However,
there are
are many
many
more
candidate
frames in
in
results
of those
the final
of theHowever,
model are compared.
Table more
2 shows
the detection
the
SSD
model
than
YOLOv3,
resulting
in
a
much
slower
detection
speed
than
YOLOv3.
results
with
different
pruning
levels.
Compared
with
the
model
after
clustering
based
the SSD model than YOLOv3, resulting in a much slower detection speed than YOLOv3.
After
comparing
all
aspects,
finally
chose
the
YOLOv3
model
on the
sheep face
itswe
pruned
model
has
a slight
improvement
both mAP andIn
After
comparing
alldataset,
aspects,
we
finally
chose
the
YOLOv3
model for
forinoptimization.
optimization.
In adaddition,
we
performed
K-means
clustering
of
anchor
frames
in
YOLOv3
based
on
the
sheep
detection
speed.
The
model
size
has
also
been
reduced
from
235
MB
to
61
MB,
making
dition, we performed K-means clustering of anchor frames in YOLOv3 based on the sheep
it easier to deploy.
Comparedthe
with the original
YOLOv3
model, the proposed
model for
face
face dataset.
dataset. Compared
Compared with
with the initial
initial model,
model, the
the mAP
mAP improved
improved by
by 1.1%.
1.1%.
sheep face dataset clustering based on YOLOv3 with pruned layers and pruned channels
reduces
the number ofdifferent
convolutional layers
and the number of convolutional layers and the
Table
Table 1.
1. Comparison
Comparison of
of different network
network detection
detection results.
results.
number of network channels by a certain amount. The mAP, accuracy, recall and F1-score
mAP
Precision
Recall
F1-Score
Parameters
allModel
improved by 1.90%,
and 5.10%., respectively.
speed was
also
Model
mAP7%, 1.80%
Precision
Recall The detection
F1-Score
Parameters
improved
from
9.7
ms
to
8.7
ms,
while
the
number
of
parameters
has
been
significantly
Faster R-CNN
90.20%
80.63%
90%
84.00%
108 MB
Faster R-CNN
SSD
SSD
YOLOv3
YOLOv3
YOLOv4
YOLOv4
90.20%
98.73%
98.73%
95.30%
95.30%
91.15%
91.15%
80.63%
96.85%
96.85%
82.90%
82.90%
88.70%
88.70%
90%
96.25%
96.25%
95.70%
95.70%
88.00%
88.00%
84.00%
96.35%
96.35%
88.70%
88.70%
87.50%
87.50%
108 MB
100
100 MB
MB
235
MB
235 MB
244
244 MB
MB
Animals 2022, 12, 1465
reduces the number of convolutional layers and the number of convolutional layers and
the number of network channels by a certain amount. The mAP, accuracy, recall and F1score all improved by 1.90%, 7%, 1.80% and 5.10%., respectively. The detection speed was
also improved from 9.7 ms to 8.7 ms, while the number of parameters has been significantly reduced. To make the results more convincing, we mixed the training and test12sets
of 15
and trained the model 10 times using a 10-fold cross-validation method. The results are
shown in Figure 11. The histogram represents the mean value of 10 experiments, and the
error barsreduced.
represent
the maximum
of 10
mAP
To make
the results and
moreminimum
convincing,values
we mixed
theexperiments
training and with
test sets
and
value of 96.84%.
trained the model 10 times using a 10-fold cross-validation method. The results are shown
in Figure 11. The histogram represents the mean value of 10 experiments, and the error
Table 2. Comparison
of the
of different
levels of pruning.
bars represent
theresults
maximum
and minimum
values of 10 experiments with mAP value
of 96.84%.
Model
Prune_channel
Prune_layer
Prune_channel_layer
Model
mAP
Precision Recall F1-Score Parameters Speed
Table 2. Comparison
of the results
of different97.00%
levels of pruning.
Prune_channel
96.80%
89.50%
92.80%
69.9 MB
8.9 ms
Prune_layer
95.70%
89.50%
95.70%
91.90%
132 MB
9.2 ms
mAP
Precision
Recall
F1-Score
Parameters
Speed
Prune_chan96.80%
89.50%
97.00%
92.80%93.30% 69.961.5
MB MB
8.9
97.20%
89.90%
97.50%
8.7ms
ms
nel_layer
95.70%
89.50%
95.70%
91.90%
132 MB
9.2 ms
97.20%
89.90%
97.50%
93.30%
61.5 MB
8.7 ms
Figure 11. 10-Fold cross-validation results of our model.
4.2. Discussion
It is worth noting that the sheep face data collected using cameras is relatively large
Figure 11. 10-fold cross-validation results of our model.
compared to data collected by other sensors. If the data need Internet for transmission,
it may be a challenge. This is especially true for ranches in areas with poor Internet
4.2. Discussion
connectivity. For this reason, our vision is to propose that sheep face recognition be
performed using edge computing, thus improving response time and saving bandwidth.
The processed results are transmitted instead of the original video to a centralized server
via cable or a local area network on a 4G network. Such an infrastructure is now available
for modern farms for monitoring purposes. The extracted information will be stored and
correlated in the server. In addition to this, we are able to apply sheep face recognition to
a sheep weight prediction system to further realize non-contact sheep weight prediction,
thus avoiding stress reactions in the sheep that could affect their growth and development.
In this paper, we present a method for identifying sheep based on facial information
using a deep learning target detection model. In order to ensure that the method is accepted
in practical applications, in the short-term future plans, we will focus on (a) Significantly
increasing the number of sheep tested to verify the reliability of sheep face recognition,
(b) Development of a specialized sheep face data collection system with edge computing
capabilities, (c) Optimize shooting methods to avoid blurred, obscured images, etc. as
much as possible, as well as, (d) The model is further compressed by attempting parameter
quantization, network distillation, network decomposition, and compact network design
Animals 2022, 12, 1465
13 of 15
to obtain a lightweight model. Moreover, maintaining accuracy while compressing is also
an issue to be considered. In the long-term plan for the future, we will develop more
individual sheep monitoring functions driven by target detection algorithms and integrate
them together with edge computing.
5. Conclusions
In this paper, we propose a deep learning-based YOLOv3 network for contactless
identification of sheep in pastures. This study further extends the idea of individual sheep
identification through image analysis and considers animal welfare issues. Our proposed
method uses YOLOv3 as the base network and modifies the anchor boxes in the YOLOv3
network model. The IoU function is used as the distance metric of the K-means algorithm.
The K-means clustering algorithm is used to cluster the sheep face dataset to generate new
anchors, replacing the original anchors in YOLOv3. It makes the network more applicable
to the sheep face dataset and improves the accuracy of the network model in recognizing
sheep faces. In order to make the proposed network model easy to deploy, the clustered
network model is compressed. At the same time, the cut channel and cut layer operations
were performed. The test results show that, under normal conditions, the F1-score and
mAP of the model reached 93.30% and 97.20%, respectively. It significantly demonstrates
the effectiveness of our strategy on the test dataset. It facilitates the daily management of
the farm and also provides a good technical support for the smart sheep farm to achieve
welfare farming. The proposed model is capable of inferring images of size 416 pixels at
8.7 ms on GPU, which has the potential to be applied to embedded devices. However, there
are still some shortcomings in our method, such as the small sample size of the experiment.
In the future, we will try to test the ability to identify large numbers of sheep by applying
our model in a real ranch and consider the application of sheep face recognition to sheep
weight prediction system.
Author Contributions: Conceptualization, S.S. and T.L.; methodology, S.S.; software, S.S.; validation,
H.W., B.H. and H.S.; formal analysis, S.S.; investigation, S.S., C.Y. and F.G.; resources, H.S.; data curation, S.S.; writing—original draft preparation, S.S.; writing—review and editing, S.S.; visualization,
S.S.; supervision, T.L.; project administration, T.L.; funding acquisition, T.L. All authors have read
and agreed to the published version of the manuscript.
Funding: This research was funded by the Key Project Supported by Science and Technology of the
Tianjin Key Research and Development Plan (grant number 20YFZCSN00220), Central Government
Guides Local Science and Technology Development Project (grant number 21ZYCGSN00590), and
Inner Mongolia Autonomous Region Department of Science and Technology Project (grant number
2020GG0068).
Institutional Review Board Statement: The study was conducted according to the guidelines of the
Declaration of Helsinki, and approved by the Institutional Ethics Committee of Tianjin Agricultural
University (2021LLSC026, 30 April 2021).
Informed Consent Statement: Not applicable.
Data Availability Statement: The dataset can be obtained by contacting the corresponding author.
Acknowledgments: The authors also express their gratitude to the staff of the ranch for their help in
the organization of the experiment and data collection.
Conflicts of Interest: The authors declare no conflict of interest.
Animals 2022, 12, 1465
14 of 15
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
Scordino, J. Steller Sea Lions (Eumetopias jubatus) of Oregon and Northern California: Seasonal Haulout Abundance Patterns,
Movements of Marked Juveniles, and Effects of Hot-Iron Branding on Apparent Survival of Pups at Rogue Reef. Master’s Thesis,
Oregon State University, Corvallis, OR, USA, 2006. Available online: https://ir.library.oregonstate.edu/concern/graduate_
thesis_or_dissertations/n870zw20m (accessed on 30 May 2021).
Vestal, M.K.; Ward, C.E.; Doye, D.G.; Lalman, D.L. Beef cattle production and management practices and implications for
educators. In Proceedings of the Agricultural and Applied Economics Association (AAEA) Conferences, Annual Meeting, Long
Beach, CA, USA, 23–26 July 2006. [CrossRef]
Leslie, E.; Hernández-Jover, M.; Newman, R.; Holyoake, P. Assessment of acute pain experienced by piglets from ear tagging, ear
notching and intraperitoneal injectable transponders. Appl. Anim. Behav. Sci. 2010, 127, 86–95. [CrossRef]
Xue, H.; Qin, J.; Quan, C.; Ren, W.; Gao, T.; Zhao, J. Open Set Sheep Face Recognition Based on Euclidean Space Metric. Math.
Probl. Eng. 2021, 2021, 3375394. [CrossRef]
Yan, H.; Cui, Q.; Liu, Z. Pig face identification based on improved AlexNet model. INMATEH Agric. Eng. 2020, 61, 97–104.
[CrossRef]
Neethirajan, S. The role of sensors, big data and machine learning in modern animal farming. Sens. Bio-Sens. Res. 2020, 29, 100367.
[CrossRef]
Schofield, D.; Nagrani, A.; Zisserman, A.; Hayashi, M.; Matsuzawa, T.; Biro, D.; Carvalho, S. Chimpanzee face recognition from
videos in the wild using deep learning. Sci. Adv. 2019, 5, eaaw0736. [CrossRef]
Chen, R.; Little, R.; Mihaylova, L.; Delahay, R.; Cox, R. Wildlife surveillance using deep learning methods. Ecol. Evol. 2019, 9,
9453–9466. [CrossRef]
Çevik, K.K.; Mustafa, B. Body condition score (BCS) classification with deep learning. In Proceedings of the IEEE International Conference on Innovations in Intelligent Systems and Applications Conference (ASYU), Izmir, Turkey, 31 October–2
November 2019. [CrossRef]
Guo, Y.; He, D.; Chai, L. A machine vision-based method for monitoring scene-interactive behaviors of dairy calf. Animals 2020,
10, 190. [CrossRef]
Andrew, W.; Greatwood, C.; Burghardt, T. Aerial animal biometrics: Individual friesian cattle recovery and visual identification
via an autonomous uav with onboard deep inference. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), Macau, China, 3–8 November 2019.
Halachmi, I.; Guarino, M.; Bewley, J.; Pastell, M. Smart animal agriculture: Application of real-time sensors to improve animal
well-being and production. Annu. Rev. Anim. Biosci. 2019, 7, 403–425. [CrossRef]
Hansen, M.F.; Smith, M.L.; Smith, L.N.; Salter, M.G.; Baxter, E.M.; Farish, M.; Grieve, B. Towards on-farm pig face recognition
using convolutional neural networks. Comput. Ind. 2018, 98, 145–152. [CrossRef]
Barron, U.G.; Corkery, G.; Barry, B.; Butler, F.; McDonnell, K.; Ward, S. Assessment of retinal recognition technology as a biometric
method for sheep identification. Comput. Electron. Agric. 2008, 60, 156–166. [CrossRef]
Andrew, W.; Gao, J.; Mullan, S.; Campbell, N.; Dowsey, A.W.; Burghardt, T. Visual identification of individual Holstein-Friesian
cattle via deep metric learning. Comput. Electron. Agric. 2021, 185, 106133. [CrossRef]
Billah, M.; Wang, X.; Yu, J.; Jiang, Y. Real-time goat face recognition using convolutional neural network. Comput. Electron. Agric.
2022, 194, 106730. [CrossRef]
Chen, P.H.; Lin, C.J.; Schölkopf, B. A tutorial on ν-support vector machines. Appl. Stoch. Models Bus. Ind. 2005, 21, 111–136.
[CrossRef]
Tharwat, A.; Gaber, T.; Hassanien, A.E. Cattle identification based on muzzle images using gabor features and SVM classifier. In
Proceedings of the International Conference on Advanced Machine Learning Technologies and Applications, Cairo, Egypt, 28–30
November 2014. [CrossRef]
Bovik, A.C. The Essential Guide to Image Processing; Academic Press: Cambridge, MA, USA, 2009.
Rusk, C.P.; Blomeke, C.R.; Balschweid, M.A.; Elliot, S.J.; Baker, D. An evaluation of retinal imaging technology for 4-H beef and
sheep identification. J. Ext. 2006, 44, 5FEA7.
Hou, J.; Yang, H.; Connor, T.; Gao, J.; Wang, Y.; Zeng, Y.; Zhang, J.; Huang, J.; Zheng, B.; Zhou, S. Identification of animal
individuals using deep learning: A case study of giant panda. Biol. Conserv. 2020, 242, 108414. [CrossRef]
Salama, A.; Hassanien, A.E.; Fahmy, A. Sheep identification using a hybrid deep learning and bayesian optimization approach.
IEEE Access 2019, 7, 31681–31687. [CrossRef]
Corkery, G.P.; Gonzales-Barron, U.A.; Butler, F.; Mc Donnell, K.; Ward, S. A preliminary investigation on face recognition as a
biometric identifier of sheep. Trans. Asabe 2007, 50, 313–320. [CrossRef]
Yang, H.; Zhang, R.; Robinson, P. Human and sheep facial landmarks localization by triplet interpolated features. In Proceedings
of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016. [CrossRef]
Ma, C.; Sun, X.; Yao, C.; Tian, M.; Li, L. Research on Sheep Recognition Algorithm Based on Deep Learning in Animal Husbandry.
J. Phys. Conf. Ser. 2020, 1651, 012129. [CrossRef]
Hitelman, A.; Edan, Y.; Godo, A.; Berenstein, R.; Lepar, J.; Halachmi, I. Biometric identification of sheep via a machine-vision
system. Comput. Electron. Agric. 2022, 194, 106713. [CrossRef]
Animals 2022, 12, 1465
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
15 of 15
Nasirahmadi, A.; Sturm, B.; Olsson, A.C.; Jeppsson, K.H.; Müller, S.; Edwards, S.; Hensel, O. Automatic scoring of lateral and
sternal lying posture in grouped pigs using image processing and Support Vector Machine. Comput. Electron. Agric. 2019, 156,
475–481. [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE
Trans. Image Processing 2004, 13, 600–612. [CrossRef] [PubMed]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–23 June 2016.
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Venice, Italy, 22–29 October 2017. [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural
Inf. Process. Syst. 2015, 28, 91–99. [CrossRef]
Mozer, M.; Smolensky, P.; Touretzky, D. Advances in Neural Information Processing Systems; Morgan Kaufman: Burlington, MA,
USA, 1989.
Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, H.P. Pruning filters for efficient convnets. arXiv 2016, arXiv:1608.08710.
[CrossRef]
Gong, Y.; Liu, L.; Yang, M.; Bourdev, L. Compressing deep convolutional networks using vector quantization. arXiv 2014,
arXiv:1412.6115. [CrossRef]
Dettmers, T. 8-bit approximations for parallelism in deep learning. arXiv 2015, arXiv:1511.04561. [CrossRef]
Lebedev, V.; Ganin, Y.; Rakhuba, M.; Oseledets, I.; Lempitsky, V. Speeding-up convolutional neural networks using fine-tuned
cp-decomposition. arXiv 2014, arXiv:1412.6553. [CrossRef]
Oseledets, I.V. Tensor-train decomposition. Siam J. Sci. Comput. 2011, 33, 2295–2317. [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531.
Zagoruyko, S.; Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks
via attention transfer. arXiv 2016, arXiv:1612.03928. [CrossRef]
Fan, S.; Liang, X.; Huang, W.; Zhang, V.J.; Pang, Q.; He, X.; Li, L.; Zhang, C. Real-time defects detection for apple sorting using
NIR cameras with pruning-based YOLOV4 network. Comput. Electron. Agric. 2022, 193, 106715. [CrossRef]
Lv, X.; Hu, Y. Compression of YOLOv3-spp Model Based on Channel and Layer Pruning. In Intelligent Equipment, Robots, and
Vehicles; Springer: Singapore, 2021; pp. 164–173.
Shao, Y.; Zhang, X.; Chu, H.; Zhang, X.; Zhang, D.; Rao, Y. AIR-YOLOv3: Aerial Infrared Pedestrian Detection via an Improved
YOLOv3 with Network Pruning. Appl. Sci. 2022, 12, 3627. [CrossRef]
Download