Uploaded by chaobin

2

advertisement
Original Article
A deep learning semantic
segmentation network with attention
mechanism for concrete crack
detection
Structural Health Monitoring
2023, Vol. 22(5) 3006–3026
Ó The Author(s) 2023
Article reuse guidelines:
sagepub.com/journals-permissions
DOI: 10.1177/14759217221126170
journals.sagepub.com/home/shm
Jiaqi Hang1 , Yingjie Wu1, Yancheng Li2 , Tao Lai1,
Jinge Zhang1 and Yang Li3
Abstract
In this research, an attention-based feature fusion network (AFFNet), with a backbone residual network (ResNet101)
enhanced with two attention mechanism modules, is proposed for automatic pixel-level detection of concrete crack. In
particular, the inclusion of attention mechanism modules, for example, the vertical and horizontal compression attention
module (VH-CAM) and the efficient channel attention upsample module (ECAUM), is to enable selective concentration
on the crack feature. The VH-CAM generates a feature map integrating pixel-level information in vertical and horizontal
directions. The ECAUM applied on each decoder layer combines efficient channel attention (ECA) and feature fusion,
which can provide rich contextual information as guidance to help low-level features recover crack localization. The proposed model is evaluated on the test dataset and the results reach 84.49% for mean intersection over union (MIoU).
Comparison with other state-of-the-art models proves high efficiency and accuracy of the proposed method.
Keywords
Semantic segmentation, attention mechanism, crack detection, deep learning
Introduction
Due to the low tensile strength of concrete,1 cracks will
inevitably appear under the influence of external load
and temperature change in concrete structures. The
existence of crack accelerates the corrosion of rebar,
seriously affecting the load-carrying capacity and durability of the structure.2 Because cracks are an important indicator to evaluate structural damage and
durability,3 crack detection is of considerable importance in concrete structure maintenance. The traditional crack detection method is through periodical
manual inspection and generally involves sending
inspectors to measure cracks by the use of bulky equipment.4 However, the results are susceptible to subjective factors and the process is also time-consuming and
labor-intensive.5 For these reasons, new crack detection approach with high accuracy and efficiency is
desirable and such development is of great interest to
all stakeholders.
To overcome the drawbacks of these human-based
methods, many image-processing-technique (IPT)based crack detections have been proposed, such as
image threshold,6,7 edge detection8,9 and morphological operation.10,11 However, the prediction results rely
highly on manually defined parameters and are easily
affected by complex environment in real-world
situations.
Recently, deep learning (DL) has been proposed
and rapidly developed12 as a result of the reasons mentioned above. Compared with IPTs, DL can automatically extract abundant abstract features from massive
surface crack data.13 Therefore, researchers have
adopted DL algorithms to improve the accuracy and
efficiency of crack detection.14,15 Due to the
1
College of Civil Engineering, Nanjing Tech University, Nanjing, Jiangsu,
China
2
School of Civil and Environmental Engineering, University of Technology
Sydney, Ultimo, NSW, Australia
3
Institute of Automation, Qilu University of Technology (Shandong
Academy of Sciences), Jinan, Shandong, China
Corresponding author:
Yancheng Li, School of Civil and Environmental Engineering, University of
Technology Sydney, 15 Broadway, Ultimo, NSW 2007, Australia.
Email: yancheng.li@uts.edu.au
Hang et al.
outstanding ability to achieve pixel-level prediction of
cracks, most recently, semantic segmentation
approaches, which can classify each pixel as crack or
noncrack,16 have obtained increasing attention. The
semantic segmentation network is usually with
encoder-decoder model, which utilizes the encoder
backbone derived from the state-of-the-art image classification architecture.17–19 Yang et al.20 used fully convolutional network (FCN) with a backbone VGG19 to
detect concrete crack, where training time was lower
than CrackNet21 on account of the end-to-end structure. Huyan et al.22 adopted two U-Net with backbone
VGGNet and ResNet to perform pavement crack segmentation task from good quality images without
noise, which exhibited significant advantages compared
to FCN. Li et al.23 proposed FCN with a backbone
DenseNet121 to analyze smartphone images for automatic detection of four damages in the concrete structure. On the other hand, to prevent information loss in
down-sampling operation, researchers tend to develop
DL models without pooling layers. Zhang et al.24 proposed a model named CrackNetII, which removed all
pooling layer, for automatic pavement crack detection
using 3000 asphalt surface images. However, those
semantic segmentation approaches have an obvious
challenge that they cannot aggregate rich contextual
information well. To address it, two well-known models are proposed to aggregate such context. The first is
DeepLabv3,25 which uses atrous spatial pyramid pooling (ASPP) to fuse feature maps at different scales to
capture contextual information. The second is
PSPNet,26 which employs pyramid pooling module to
aggregate multiscale context. Wang et al.27 used 2446
concrete and asphalt crack images to train and evaluate
five semantic segmentation models, including FCN,
GCN, PSPNet, UPerNet and DeepLabv3 + , and found
that DeepLabv3 + shows the best performance. Ji
et al.28 adopted DeepLabv3 + to do the automatic
detection of pavement crack, and crack images collected by UAV were tested to obtain the mean intersection over union (MIoU) of 78.75%. However, the
above method can only collect local contextual information, while the global information on the crack
image cannot be fully captured.
The attention mechanism, firstly proposed by
Bahdanau et al.,29 is a cognitive process to enable
selective concentration on nominated features and
intentionally disregard unimportant information. The
attention mechanism can be applied to semantic segmentation models as global context exploration methods, such as SENet,30 CBAM,31 and DANet.32 In
crack detection, there have been attempts to combine
the attention mechanism with available networks to
improve detection efficiency.33–35 For example, Pan
et al.36 modified the backbone of DANet from
3007
ResNet101 to VGG19, namely SCHNet, and added a
new attention mechanism named feature pyramid
attention to improve crack detection accuracy. The
results demonstrated that three attention mechanisms
can increase MIoU by 10.88% than the baseline
model. The attention maps in SCHNet are obtained by
calculating the similarities among all pixels in the feature map. However, these feature maps are obtained
by 1 3 1 convolution, which means that each pixel is
influenced by only one pixel in the input feature map.
This leads to a troublesome situation that one pixel is
unable to contain much spatial information, which
makes the attention map not optimally designed. In
addition, SCHNet has not considered the fusion of
low-level features and high-level features, which could
help the decoder to generate high-resolution semantic
features. Meanwhile, the direct fusion will downgrade
the performance of crack segmentation. To solve these
problems, this research designs a novel model combining FCN and two attention mechanism modules to
aggregate rich contextual information for automatic
concrete crack detection.
Upon the crack detection from the image, the evaluation of the crack such as crack area and crack width
would be a direct demand, which is able to assist the
practitioners for decision-making and further maintenance schedule. Conversion from crack pixel information to physical dimension is required to achieve above
objective. Li et al.37 proposed a crack image binarization architecture called SegNet-DCRF and further calculated the unidirectional crack width and web crack
area. Bhowmick et al.38 proposed U-Net architecture
for crack image segmentation and used morphological
operations from image processing to quantify the geometrical properties of concrete surface cracks. Built on
the accurate identification of crack from image, this
research conducts the morphological feature measurement and the crack severity ranking to enable potential
application of the proposed algorithm.
In this research, an attention-based feature fusion
network (AFFNet) is proposed for crack segmentation
under various complex conditions. The use of attention
mechanism is to aggregate crack features and suppress
irrelevant features to improve segmentation performance. To capture rich contextual information, the
vertical and horizontal compression attention module
(VH-CAM) is set on the top of the backbone
ResNet101,39 which uses two asymmetric convolutions
to enable the single pixel containing more information.
Meanwhile, the efficient channel attention upsample
module (ECAUM) combines the efficient channel
attention (ECA) and feature fusion to restore semantic
boundaries by guiding low-level features. In consequence, these two attention mechanism modules can
contribute to better feature representations and more
3008
Structural Health Monitoring 22(5)
Figure 1. Overall architecture of proposed AFFNet for semantic segmentation. The red and green lines represent down-sampling
and up-sampling operations. The proposed model uses ResNet101 as backbone and apply VH-CAM and ECAUM to improve crack
segmentation.
AFFNet: attention-based feature fusion network; ResNet101: residual network; VH-CAM: vertical and horizontal compression attention module;
ECAUM: efficient channel attention upsample module.
precise crack segmentation results. In addition, the
semantic segmentation images of crack by AFFNet are
used to quantitatively measure the morphological features of cracks by using single-pixel width skeletons.
The content of this paper is organized as follows.
Section ‘‘Methodology’’ presents the detailed feature
fusion network. Section ‘‘Implementation details’’ is
devoted to the implementation details, including process of generating the dataset and training parameters
setting. Section ‘‘Experimental results’’ introduces the
experimental results and corresponding analysis.
Section ‘‘Discussion’’ describes the discussion of different test sets. Section ‘‘Conclusion’’ summarizes the
conclusion of this paper.
Methodology
To distinguish crack pixels and non-crack pixels, this
research proposes a novel model named AFFNet with
a backbone ResNet101 pretrained on ImageNet and
the integration of two attention mechanism modules.
The backbone network, ResNet,39 is the first convolutional neural network with a depth of more than 100
layers, which solves the degradation problem that
accuracy tends to saturate and then decreases with the
depth of the network increasing. It won the ILSVRC
and COCO 2015 competition and has been widely used
in semantic segmentation. In addition, dilated convolution with a rate of two is employed and the stride is
modified from two to one in the last ResNet block to
enlarge the output size of ResNet101 from 1/32 to 1/16
of the raw image. In this way, more details of feature
maps can be retained. The structure of the proposed
AFFNet, which is made up of two attention mechanism modules including VH-CAM and ECAUM, is
shown in Figure 1. The quadrangular prisms represent the block of ResNet101 and the arrows represent the operations of the model. The detailed
network parameters are listed in Table 1. Notably,
ReLU and BN layers, which are used in the
ResNet101, are not presented in Table 1. The keep
probability of the dropout layer in Figure 1 is 0.9,
which can assign the value of 0 to each channel
with a probability of 0.1.
Similar to human attention, the attention mechanism can concentrate on the features that need to be
focused on to acquire more details and ignore irrelevant information.40 The essence of the attention
mechanism is to learn weight distribution relevant to
feature maps. In recent years, the attention mechanism
has been developed rapidly in computer vision in light
of its advantages. Here, VH-CAM and ECAUM are
chosen with key features and explained in following
sections.
Vertical and horizontal compression attention
module
It is known that the contextual information is of considerable importance in semantic segmentation due to
multiple scales of objects.41 However, local features
from traditional FCN may mislead the classification
Hang et al.
3009
Table 1. The detailed configuration of each layer in AFFNet.
Layer name
Pad
Kernel size
(width 3 height, channels)
Stride
Output size
(width 3 height)
Output
channels
Dilation
Input
Conv1
Maxpool
Res-1
–
3
12 3
0
4 1 533
203
0
4 1 534
203
0
4 1 5323
203
0
4 2 533
0
–
–
–
–
–
0
–
1
–
–
7 3 7, 64
333
2
3
131, 64
4 333, 64 533
2 131, 256 3
131, 128
4 333, 128 534
2 131, 512 3
131, 256
4 333, 256 5323
2 131, 1024 3
131, 512
4 333, 512 533
131, 2048
–
–
–
–
–
1 3 1, 2
–
3 3 3, 2
–
–
2
2 3
2
1
4 1 533
213
2 3
1
1
4 2 531 + 4 1 533
213
213
1
1
4 2 531 + 4 1 5322
1
213
1
4 1 533
1
–
–
–
–
–
1
–
1
–
224 3 224
112 3 112
56 3 56
56 3 56
3
64
64
256
28 3 28
512
14 3 14
1024
14 3 14
2048
14 3 14
14 3 14
28 3 28
56 3 56
56 3 56
56 3 56
224 3 224
224 3 224
224 3 224
2048
1024
512
256
256
2
2
2
2
–
1
– 3
2
1
4 1 533
213
1
4 1 534
213
1
4 1 5323
213
1
4 2 533
1
–
–
–
–
–
1
–
1
–
Res-2
Res-3
Res-4
VH-CAM
ECAUM(1)
ECAUM(2)
ECAUM(3)
Dropout
Conv2
Upsample
Conv3
Softmax
AFFNet: attention-based feature fusion network; VH-CAM: vertical and horizontal compression attention module; ECAUM: efficient channel
attention upsample module.
process at the pixel level.32 To overcome this issue, the
VH-CAM is introduced,41 which can capture rich contextual information to accomplish the crack segmentation task. Different from the position attention module
in DANet, VH-CAM employs two asymmetric convolutions with the kernel size of 1 3 W and H 3 1 to
enable each pixel to contain more information. Then
the attention map is obtained through matrix operations, which is more comprehensive than using 1 3 1
convolution. Next, the process to aggregate contextual
information between crack and background is
introduced.
The exact working principle of VH-CAM is
described in Figure 2. The feature map A 2 RC3H3W ,
where H, W, and C represent the height, width, and
channel number of the feature map, respectively, is
firstly sent into two asymmetric convolution layers of
1 3 W and H 3 1 to generate two compressed feature
0
maps B and D, shown in Figure 3. B 2 RC 3H31 and
0
0
D 2 RC 313W have different number of channels (C ) as
0
A. Then D is reshaped to RC 3W . Meanwhile, after performing reshape and transpose operations, B is trans0
formed to RH3C and the specific operation is shown in
Figure 4. After that a matrix multiplication between B0
and D0 is performed and then the sigmoid function is
utilized to generate the two-dimensional attention map
E 2 RH3W :
eij =
1
1 + expðB0 i 3D0 j Þ
ð1Þ
where B0 i indicates the ith row vector of feature map B0
and D0 j indicates the jth column vector of feature map
D0 . Similarly, eij indicates the weight at the ith row and
jth column of the attention map E after matrix multiplication operation.
Then, a multiplication operation is applied to E and
A to generate a new feature map F 2 RC3H3W possessing rich context as:
Fm ði, jÞ = Am ði, jÞeij
ð2Þ
where Am (i, j) indicates the element at the ith row and
jth column of the mth (m = 1, 2.C) channel in the original feature map A, and Fm (i, j) has the same definition.
Finally, each element in feature map F is multiplied
by the parameter a. And then the point-wise addition
operation is performed on the original feature map A
and the above multiplied result to generate the final
output G 2 RC3H3W :
Gm ði, jÞ = a3Fm ði, jÞ + Am ði, jÞ
ð3Þ
3010
Structural Health Monitoring 22(5)
Figure 2. Structure of the proposed vertical and horizontal compression attention module.
Figure 3. The example of asymmetric convolution (refer to the mathematical process to generate B feature map in Figure 1, the
same principle can be used to generate feature map D when the convolutional kernel is a H 3 1 vector).
Figure 4. The example of reshape and transpose operations.
where a is a learnable weight that is initialized to 0. It
can be seen from Equation (2) that the attention map
E assigns different weights to each element in the original feature map A. Therefore, the final output G in
Equation (3) can aggregate long-range contexts according to the spatial attention map.
Efficient channel attention upsample module
In general, low-level features and high-level features are
equally important.42 To restore the lost details in consecutive down-sampling, many models adopt encoderdecoder structures, such as FCN,43 U-Net44, and
SegNet.45 However, these encoder-decoder structures
Hang et al.
3011
Figure 5. Structure of the proposed efficient channel attention upsample module.
lack appropriate guidance and may cause misclassification.41 To overcome this problem, ECAUM combined
with the attention mechanism and feature fusion is
adopted here. Due to less parameters involved and high
performance, the ECA46 in ECAUM is performed to
provide high-level semantic information as guidance to
help low-level features select precise resolution details.
Moreover, since residual blocks dominate performance
of ResNet101, three ECAUM is used to perform the
feature fusion with residual block and decoder.
The structure of ECAUM is illustrated in Figure 5
and the mathematical evolution of ECA is described
in Figure 6. First, the high-level feature map
0
0
0
B 2 RC 3H 3W is put through the global average pooling (see illustration in Figure 6) to obtain rough
global contextual information without dimensionality
reduction:
Ym =
H X
W
1 X
Bm ði, jÞ
H3W i = 1 j = 1
ð4Þ
Here, Bm ði, jÞ indicates the element at the ith row and jth
column of the mth (m = 1,2.C) channel in the highlevel feature map B, and Ym indicates average pixel of
the mth channel in the feature vector Y.
Then, the channel weight vector Y 0 is obtained using
a 1D convolution of kernel size k. To make the channel
number of vector Y 0 equal to the channel number of
low-level feature map A, a 1 3 1 convolution is used.
After that the sigmoid function is adopted to obtain
the vector Y 00 to limit the range of weight vector Y 0 to
[0;1]:
Y 00 = sðw2 ðw1 ðY ÞÞÞ
ð5Þ
where w1 indicates 1D convolution, w2 indicates 1 3 1
convolution and s indicates the sigmoid function.
Then, Y 00 is utilized as a guidance of the feature map
A with below operation:
Em = w3 ðY 00 m Am + Am Þ
ð6Þ
where w3 indicates 1 3 1 convolution and the feature
map E 2 RC3H 3W is the same size as A. Am indicates
the element of the mth channel in A and the definitions
of Em and Ym00 are the same.
The fusion of low-level features and high-level features is an effective approach to restore the lost
details caused by the consecutive down-sampling. The
transposed convolution upsample is utilized as an
efficient method
to
enlarge the high-level feature
0
0
0
map B 2 RC 3H 3W and then the 1 3 1 convolution
operation is adopted to obtain the new feature
map D 2 RC3H 3W , which has the same size as feature
map A:
D = w 4 ð d ð BÞ Þ
ð7Þ
3012
Structural Health Monitoring 22(5)
Figure 6. The details of efficient channel attention.
where d indicates transposed convolution and w4 indicates 1 3 1 convolution operation.
Finally, E and D is spliced and a 3 3 3 convolution
operation followed by BN and ReLU is adopted to
obtain the final output F 2 RC3H3W . Notably, the last
ECAUM module uses two 3 3 3 convolutions to
increase the depth of the model:
F = w5 ðjðD, EÞÞ
ð8Þ
where j indicates concatenation operation and w5 indicates 3 3 3 convolution. The output F is used as the
high-level feature map for the next module.
Implementation details
Dataset construction
To verify the effectiveness and robustness of AFFNet,
a concrete crack image dataset is constructed for the
following experimental validations. To ensure the
variability, the images in the dataset contain not only
wall cracks, but also pavement and bridge cracks,
saved in JPG format. Furthermore, some of the crack
images also contain various types of noise that often
observed with concrete structures, such as spots, shadows, water stain, handwriting, Gaussian noise, and
insufficient lightening.
The dataset contains 1760 crack images, of which
776 crack images are found in paper,20 524 crack
images are collected manually, and 460 crack images
are generated using data augmentation techniques. The
manually collected images are taken by a 40-megapixel
smartphone at different distances without zoom, where
the aperture is f/1.8, the ISO is 50, and the original full
image resolution is 2736 3 3648 pixels. To decrease
the computational cost of the training model, the original images are cropped into sub-images with a size of
224 3 224 pixels. In order to detect crack with more
complex environments, we use data augmentation
techniques such as rotation and Gaussian noise to
increase the complexity of the dataset. The proposed
AFFNet generates the crack shape and location
through segmenting crack images to obtain important
crack features. Therefore, the images obtained by cropping operation are labeled as ground truths using
Photoshop software. Then, these ground truths are
converted to PNG format with a single channel, where
crack pixels and background pixels are labeled as 255
and 0, respectively. In order to assess the generalization
ability of the proposed AFFNet, 1760 images in the
dataset are randomly divided into three parts that 64%
are used for training, 16% are used for validation, and
the last 20% are used to test the model. Specifically,
eight types of cracks are included in the dataset,
Hang et al.
3013
containing cracks without noise and cracks with noise.
For the former, there are four subgroups: (1) diagonal
crack: contains only one crack in diagonal direction; (2)
transverse crack: contains only single transverse crack;
(3) reticulation crack: contains more than one crack; (4)
wide crack that is filled with stones and earth. For the
latter, there are also six groups: (1) crack with spalls in
the concrete surface; (2) crack with shadow, which contains shadow interfere with crack detection; (3) crack
with water stain, which has water stain around crack;
(4) crack with handwriting that contains black handwriting similar to crack; (5) crack with Gaussian noise;
(6) crack in insufficient lightening.
Model initialization
Model initialization is to determine whether the model
converges.47 When training the AFFNet, transfer learning is
adopted to improve the training efficiency and crack segmentation performance of AFFNet instead of training it
from scratch. In consequence, the initialization method of all
convolutional layers is the same as that of pretrained
ResNet101, where weights are initialized with the Kaiming
method,48 and biases are set to 0 and untrained.
Moreover, the AFFNet used the transposed convolution method to enlarge the high-level feature map.
Compared to other upsample methods, the transposed
convolution method is learnable and can be learned
through the network to obtain a better upsample result.
tion and updating model parameters. Due to the fast
updating speed and simple setting, the stochastic gradient descent with momentum (SGDM) is employed to
train AFFNet.12 The weight decay, an important parameter in the optimizer, is set to 0.0001. In addition, the
batch size is set to eight when training AFFNet. The
expression for updating parameters using SGDM is as
follows:
vt = gvt1 + hru J ðuÞ
ð10Þ
u = u vt
ð11Þ
where h represents learning rate, ru represents gradient
of the loss function J ðuÞ, and g represents momentum
with a value of 0.9.
The learning rate is used to control the updating
speed of the model parameters in the training process.
The small learning rate reduces the updating speed, but
an over-large leaning rate can result in parameters
hovering around the optimal value. Therefore, a learning rate decay method used exponential decay function
is adopted in this paper, as follows:
lrt = lr0 rt
ð12Þ
where r = 0.95 is the drop factor, t is the drop period,
and it is specified as the learning rate updated each
epoch.
Loss function
Evaluation metrics
The loss function can estimate the discrepancy between
the predicted result and the ground truth.20 The optimal solution of the model needs to minimize the value
of loss function by fine-tuning parameters in the training process. Therefore, the selection of an appropriate
loss function is indispensable for AFFNet. Since crack
segmentation can be regarded as pixel-level classification, cross entropy loss function is applied to the proposed AFFNet on account of its effectiveness and solid
theoretical grounding. The formula of the corresponding loss function for each pixel can be represented as:
The performance of AFFNet in crack detection needs
to be evaluated by standard and well-known metrics.48
Here in this paper, pixel accuracy (PA), mean pixel
accuracy (MPA), MIoU, and frequency weighted intersection over union (FWIoU) are used as our metrics.48
We first introduce all symbols in the formula: for a segmentation task, if the dataset contains k + 1 classes,
pij represents the amount of pixels originally belonging
to class i but misclassified into class j. In addition, this
definition refers equally well to the rest of symbols in
the formula, including pii , pjj , and pji . The detailed formula of four evaluation metrics is represented as:
L = y lnð pÞ + ð1 yÞ lnð1 pÞ
ð9Þ
where y and p indicate the ground truth value and predicted value, respectively. And the total loss for each concrete crack image is the mean of all losses for the pixels.
k
P
PA =
pii
i=0
k P
k
P
ð13Þ
pij
i=0 j=0
Optimizer
The optimizer is one of the crucial components of DL
due to the ability of minimizing the value of loss func-
MPA =
k
1 X
pii
k
k + 1 i=0 P
pij
j=0
ð14Þ
3014
Structural Health Monitoring 22(5)
Figure 7. Training and validation loss curves under three initial learning rates during 100 epochs.
Figure 8. MIoU curves of three initial learning rates on training set and validation set during 100 epochs.
MIoU: mean intersection over union.
MIoU =
1
k +1
k
P
i=0
k
P
pii
pij +
j=0
1
FWIoU = P
k P
k
k
P
pij i = 0
i=0 j=0
k
P
pji pii
ð15Þ
j=0
k
P
k
P
j=0
j=0
pij +
pij pii
k
P
ð16Þ
pji pii
j=0
Among all the above metrics, MIoU stands out to evaluate segmentation models because of its representativeness and simplicity.
Experimental results
Analysis of results
Initial learning rate. A number of studies have shown that
the initial learning rate significantly affects the convergence of the loss function.50 It is known that a small
learning rate will result in slow convergence, while a
large one may hinder convergence. To obtain an appropriate value, three initial learning rates including 102 ,
53102 , and 83102 are selected after 100 epochs of
training. The loss function curves are shown in Figure
7; it can be seen that all training losses have converged
after 100 training epochs. Since the validation set is utilized to preliminarily evaluate the performance of crack
segmentation, it is only necessary to compare the
results of loss function on the validation set. It is
observed from Figure 7 that when the initial learning
rate is 102 or 83102 , the validation loss is about
0.035 after training with 100 epochs, while the validation loss of 53102 is about 0.03. It is found that the
loss function of validation set reaches the lowest when
the initial learning rate is 53102 through comparison.
However, the loss function curves cannot fully
reflect the performance of AFFNet.51 The metric
MIoU is also used to select an appropriate initial learning rate. The MIoU curves are shown in Figure 8. It is
observed that the MIoU of the validation set is higher
Hang et al.
3015
when the initial learning rate is 53102 . Therefore, the
initial learning rate of 53102 is selected as the optimal
value in the proposed model.
on a computer with a high-performance GPU
(NVIDIA GeForce RTX 1060, 6 GB) based on the
PyTorch-1.7.1 framework. The execution time of
AFFNet is competitive with other state-of-the-art
models, with results summarized in Table 2. The descending order of execution time can be shown as: UNet . PAN . AFFNet . DeepLabv3 + . Dilated
FCN. In Table 2, U-Net shows the longest execution
time (67 ms) due to the use of deconvolutional layers.
Although Dilated FCN has the shortest execution time
(33 ms) due to the simple decoder structure, its segmentation process compromised its overall performance. In summary, AFFNet has an acceptable
execution time (52 ms) and the highest MIoU (see
Table 5).
Execution time. The execution time, which represents
training times for each image, is a valuable metric to
evaluate the model efficiency.49 In order to reflect the
advantage of execution time, AFFNet is in comparison
with four state-of-the-art models, including U-Net,
DeepLabv3 + , Dilated FCN, and PAN. Due to the
multi-scale feature fusion and wide application, UNet52 is used as a comparison. To assign the same size
of the input image as that of the output image, zeropadding is adopted in the convolutional layer. Due to
the advantage of combination with ASPP and encoderdecoder structure, DeepLabv3 + 53 is also chosen.
Dilated FCN43 as the baseline is also used for comparison. To reflect the advantage of two attention mechanisms, PAN54 with the same structure as AFFNet is
also used for comparison. In order to ensure a fair
comparison, all these models are trained with the same
hyper-parameters and epochs. The runtime is measured
Visualization of attention module. For VH-CAM, the
attention map in Figure 2 is a crucial component,
which can intuitively observe the weight distribution
after visualizing the attention map. In Figure 9, for
two input images, corresponding attention maps are
showed in column three. Significantly, red areas indicate high contribution to the feature map while blue
areas indicate low contribution. It is observed that
some blue areas are in the background that avoids the
crack. This proves that VH-CAM can indeed guide the
proposed model to focus on the crack, even if not all
the red areas are attached to the crack.
For ECAUM, the ECA is performed on each channel using 1 3 1 convolution. Due to the small size of
the feature maps and the large number of channels, it
is not feasible to directly visualize the attention map.
Here, the Grad-CAM55 is used as the visualization tool
to visualize the feature maps before and after the ECA
Table 2. Comparison of execution time.
Models
Backbone
Time (ms)
U-Net
DeepLabv3 +
Dilated FCN
PAN
AFFNet
–
ResNet101
ResNet101
ResNet101
ResNet101
67
42
33
56
52
AFFNet: attention-based feature fusion network; FCN: fully
convolutional network.
Figure 9. Visualization of feature maps produced by two attention mechanisms: (a) Input, (b) ground truth, (c) the attention map in
VH-CAM, (d) visualization results before using ECA, and (e) visualization results after using ECA.
ECA: efficient channel attention; VH-CAM: vertical and horizontal compression attention module.
Note. The Grad-CAM visualization tool is used in the last ECAUM.
3016
Structural Health Monitoring 22(5)
Figure 10. Visualization results of different attention mechanisms: (a) input, (b) ground truth, (c) without any attention mechanism,
(d) with VH-CAM, (e) with ECAUM and (f) with VH-CAM and ECAUM.
VH-CAM: vertical and horizontal compression attention module; ECAUM: efficient channel attention upsample module.
in the last attention module (i.e., ECAUM(3) in Figure
1). In Figure 9, corresponding feature maps before and
after using channel attention are visualized in columns
three and five, respectively, to verify whether it highlights
crack areas. Before using channel attention, only a few
blue areas are in the background, which means that the
model considers that cracks and background are equally
important. However, after using channel attention, most
of red areas in the background become blue. It is obvious
that the ECAUM can help to locate crack pixels. In short,
these visualization methods demonstrate the importance
of two attention mechanisms for improving segmentation
performance in crack detection.
Effects of attention mechanisms. From the previous section, it can be seen that the attention mechanism can
remarkably improve the segmentation performance by
focusing on the important features, that is, crack. In
order to further understand the advantage of two
attention mechanisms, effects of VH-CAM and
ECAUM are visualized in Figure 10, where red boxes
denote incorrect segmentation predictions. Here, comparison among four networks are demonstrated, for
example, backbone ResNet101, ResNet101 with VHCAM only, ResNet101 with ECAUM, and ResNet101
with both attention mechanisms. As shown in Figure
10(c), a part of thin cracks is missed if not using any
attention mechanisms, especially the cracks at the
image boundary. Meanwhile, Figure 10(d) demonstrates that some misclassified crack pixels at the image
boundary are now correctly classified after using VHCAM. However, it is still not a continuous crack and it
is divided into multiple disconnected segmentations.
Because ECAUM can locate crack pixels better than
VH-CAM, more crack pixels are classified correctly
and these cracks become more continuous. However, there
are still some undetected crack pixels, such as the thin crack
at the bottom of the second image in Figure 10(e). By comparison, it is found that the segmentation predictions using
both VH-CAM and ECAUM are better than using one of
them and cracks become more complete as a result.
Visualization of feature maps. Visualizing the feature
maps of DL models can provide a deep insight on how
the proposed models work. Figure 11 takes three concrete crack images as examples to show the visualization results of feature maps. It is observed that the
feature maps closer to the input layer such as Res-1
and Res-2 can capture substantial crack features.
However, the noise such as handwriting is also captured by AFFNet, which can be shown in Figure 11(a).
As the image progresses through the next layers, features become increasingly abstract, which is important
for the model to detect crack. With the increase on the
size of the feature maps in the decoder, the crack features begin to become more accurate and the noise is
filtered out. When the image reaches the output layer,
the pixels are classified as crack and background.
Comparative study
Ablation study for k in ECAUM. The ECAUM involves a
1D convolutional layer with a crucial parameter, that
is, the kernel size k. The kernel size k needs to be determined due to the ability of capturing local crosschannel interaction.46 Therefore, AFFNet is trained
using different values of k, and the comparison results
are summarized in Table 3, where k is fixed in all 1D
convolutional layers. It can be seen that MIoU shows
an increasing trend when the value of k becomes
Hang et al.
3017
Figure 11. Visualization of feature maps at different modules: (a) image 1, (b) image 2 and (c) image 3.
Note. Only one feature map is selected as example.
Table 3. Comparison results of ECAUM with different k (%).
Method
K
PA
MPA
MIoU
FWIoU
ECAUM
ECAUM
ECAUM
3
5
7
98.36
98.26
97.85
92.01
91.33
90.76
84.49
83.47
82.77
97.07
96.87
96.76
ECAUM: efficient channel attention upsample module; PA: pixel accuracy; MPA: mean pixel accuracy; MIoU: mean intersection over union; FWIoU:
frequency weighted intersection over union.
Table 4. Ablation study of two proposed attention mechanism modules on the test set (%).
Models
Backbone
Dilated FCN
AFFNet
AFFNet
AFFNet
ResNet101
ResNet101
ResNet101
ResNet101
VH-CAM
ECAUM
PA
MPA
MIoU
FWIoU
ü
ü
97.75
97.89
98.35
98.36
82.96
84.8
91.86
92.01
76.26
78.21
84.22
84.49
95.88
95.96
96.88
97.07
ü
ü
FCN: fully convolutional network; AFFNet: attention-based feature fusion network; VH-CAM: vertical and horizontal compression attention module;
ECAUM: efficient channel attention upsample module; PA: pixel accuracy; MPA: mean pixel accuracy; MIoU: mean intersection over union; FWIoU:
frequency weighted intersection over union.
smaller. Since AFFNet has more hidden layers, using
smaller k can improve the nonlinear fitting ability of
AFFNet. Consequently, the proposed AFFNet has the
best result at k = 3.
Ablation study for attention modules. The ablation study is
designed to validate the effectiveness of two attention
mechanisms. The models with different attention
mechanisms and corresponding evaluation metrics are
summarized in Table 4. Because crack pixels normally
occupy only a small proportion of the total pixels,
MPA and MIoU are sensitive to small changes in the
amount of crack pixels according to Equations (14)
and (15). Therefore, MPA and MIoU are used as main
3018
Structural Health Monitoring 22(5)
Table 5. Segmentation results of five models (%).
Models
PA
MPA
MIoU
FWIoU
U-Net
Dilated FCN
DeepLabv3 +
PAN
AFFNet
98.11
97.66
97.8
97.6
98.36
88.83
82.78
86.57
84.07
92.01
81.57
76.13
79.62
77.26
84.49
96.6
95.83
96.04
95.68
97.07
FCN: fully convolutional network; AFFNet: attention-based feature fusion network; PA: pixel accuracy; MPA: mean pixel accuracy; MIoU: mean
intersection over union; FWIoU: frequency weighted intersection over union.
indicators in this research. It can be seen that the baseline FCN without any attention mechanisms obtains
the lowest evaluation metrics, returning the MPA and
MIoU of 82.96% and 76.26%. After applying the
attention mechanism, the MPA and MIoU increase
steadily by increasing of the amount of correct detected
crack pixels. Compared to the baseline FCN, the MPA
and MIoU of the model only adopting VH-CAM can
yield a slight improvement of 1.84% and 1.95% to
84.8% and 78.21%. Meanwhile, only adopting
ECAUM can achieve a substantial increase on MPA
and MIoU with 8.9% and 7.96% to 91.86% and
84.22%. The combination of VH-CAM and ECAUM
however can yield a result of 92.01% and 84.49% in
MPA and MIoU, which proves that two attention
mechanisms work complementary.
Comparison with other semantic segmentation models. To
reflect the excellent performance of AFFNet, four
state-of-the-art models trained by the same dataset are
compared with the proposed model. The segmentation
results of five models are listed in Table 5. It is clear
that the proposed AFFNet outperforms other models.
Owing to the concentration on crack features by VHCAM and ECAUM, AFFNet can obtain the most
crack pixels than other models. Therefore, AFFNet
achieves the highest evaluation metrics and its value of
MPA and MIoU reach the highest 92.01% and
84.49%, respectively. The Dilated FCN with simple
decoder shows the lowest MPA and MIoU compared
to other models with more trainable parameters in the
decoder. PAN achieved a slightly higher MPA and
MIoU than Dilated FCN, which attributes to the simple combination of feature fusion and the attention
mechanism used, namely global attention upsample,
incapable of capturing fine crack features (76.79% in
MIoU with global attention upsample only). The combination of ASPP and encoder-decoder structure contributes to the good performance of DeepLabv3 + in
crack detection. However, two upsample operations
cannot restore the lost details efficiently, the performance of DeepLabv3 + is inferior to U-Net and
AFFNet. Four deconvolutional layers in U-Net are
able to recover the image resolution. However, its
MIoU is still lower than AFFNet on account of the
direct fusion between low-level features and high-level
features.42 Consequently, the proposed AFFNet has
the distinct advantages and achieves the best performance and can capture rich contextual information
and guide low-level features to recover the crack
localization.
On the other hand, in order to understand respective enhancement brought by two designed attention
mechanisms, the VH-CAM and ECAUM are incorporated into U-Net and DeepLabv3 + as comparison.
The same training set and test set are used to conduct
this experiment with two modified models. The result
shows that the MIoU of DeepLabv3 + increases by
1.17%, from 79.62% to 80.79%. Due to only one feature fusion operation, the performance of
DeepLabv3 + is not much improved. However, four
feature fusion operations of U-Net result in a great
improvement in its performance with the addition of
two attention mechanisms, and its MIoU increases
from 81.57% to 83.11%. These results indicate that
two attention mechanisms can indeed improve the performance of other models, but the extent of improvement is related to the number of feature fusion
operations. Thus, VH-CAM and ECAUM can be
plugged into existing semantic segmentation models.
Discussion
Visualization results of multi-type crack image
To verify the effectiveness and robustness of AFFNet,
the comparative experiment is conducted using different types of cracks. Figure 12 shows the visual comparison result between AFFNet and other models. From
top to bottom, four types of cracks are diagonal cracks,
transverse cracks, reticulation cracks, and wide cracks,
respectively. From left to right, concrete crack images
predicted by different models are input image, ground
truth, U-Net, Dilated FCN, DeepLabv3 + , PAN, and
AFFNet, respectively. It can be seen that when thin
Hang et al.
3019
Figure 12. Prediction results of different types of cracks using different models: (a) input, (b) ground truth, (c) U-Net, (d) Dilated
FCN, (e) DeepLabv3 + , (f) PAN and (g) AFFNet.
FCN: fully convolutional network; AFFNet: attention-based feature fusion network.
cracks or the low contrast between cracks and background appear, Dilated FCN, DeepLabv3 + , and PAN
are not able to capture part of thin cracks, clearly presented in the reticulation crack case. The performance
of U-Net is better than above three models, but a few
crack features in reticulation crack image is still missing. Meanwhile, for wide cracks, the edge of the crack
predicted by U-Net will appear some scattered pixels
that should belong to background pixels but are misclassified as crack pixels. In wide crack case, Dilated
FCN produces small holes in the crack area and PAN
cannot generate a complete crack due to the influence
of background noise. In contrast, AFFNet adopts two
attention mechanisms to extract more crack information, which brings great benefits in improving the accuracy of the crack detection. Overall, the segmentation
performance of AFFNet is better than other models.
Visualization results of concrete cracks under
complex conditions
It should be noted that above concrete crack images
are relatively clean and contain low-level noise.
However, in reality, the cracks are quite versatile and
can present with various imagery disturbances. These
images will be interfered by spots, shadow, water stain,
and handwriting, which increase the difficulty of crack
detection. Therefore, another comparative experiment
is conducted using cracks under complex conditions.
Figure 13 shows the visual comparison result between
AFFNet and other models on six types of cracks, such
as crack with spots, crack with shadow, crack withwater stain, crack with handwriting, and so on. It can
be seen that all the models have plausible abilities to
distinguish crack and noise when detecting cracks with
spots and shadow. However, the model deficiencies
described in the previous section still exist. For example, Dilated FCN, DeepLabv3 + , and PAN are unable
to detect the thin part of the crack and U-Net misclassifies part of background as cracks at the edge of the
wide crack. In addition, U-Net also incorrectly detects
background of the shadow edge as crack. For crack
with water stain, all other models exhibit false positives
due to the low contrast between crack and water stain.
These models overlook the width information of
cracks, usually with the predicted crack width larger
than the ground truth. Meanwhile, part of thin cracks
is also ignored by Dilated FCN, DeepLabv3 + , and
PAN. Considering the crack with handwriting, the discrepancy in crack detection is more distinct. Due to the
unified pretrained ResNet101, Dilated FCN,
DeepLabv3 + , and PAN perfectly distinguish cracks
and handwriting. However, U-Net incorrectly recognizes part of handwriting as cracks. Different from
other models, the AFFNet based on a pretrained
ResNet101 still provides a satisfactory crack segmentation result when detecting concrete cracks under complex conditions.
3020
Structural Health Monitoring 22(5)
Figure 13. Prediction results of cracks under complex conditions using different models: (a) input, (b) ground truth, (c) U-Net,
(d) Dilated FCN, (e) DeepLabv3 + , (f) PAN and (g) AFFNet.
FCN: fully convolutional network; AFFNet: attention-based feature fusion network.
Quantification of crack images
Crack identification by AFFNet in the test set are
employed for the quantification of three morphological
features at a pixel level: crack area, crack length, and
crack mean width. The crack area is obtained by calculating the number of crack pixels. The acquisition of
crack length is relatively complex. The crack needs to
be skeletonized into the thin lines with a single-pixel
width and then the crack length can be obtained by calculating the number of pixels in thin lines. In this paper,
the approach in the research56 is used to perform the
skeletonizing crack task. The crack mean width is the
ratio between the crack area and the crack length.
The quantification differences between predicted
results and ground truth are illustrated in Figure 14. As
shown in Figure 14(a), the accuracy of the crack area is
not competent yet, with scattered points diverted above
the diagnostic line, indicating that there are background pixels in crack images misclassified as crack
pixels. Meanwhile, some crack pixels are ignored by
AFFNet when the crack area is more than 6000 pixels.
It is obvious that AFFNet is susceptible to underestimating the crack area for large cracks. With respect to
crack length, it is observed that most plotted points are
near the diagnostic line, which means that AFFNet
performs well in identifying crack length. The crack
mean width is influenced by two other indicators.
Statistically, the predicted area and mean width are
greater than the ground truth in 69.6% and 72.3% of
the cases, while 74.23% of the predicted length is lower.
This means that the proposed model tends to enlarge
the crack width and decrease the crack length. The possible reason of enlargement on crack width is that
AFFNet is prone to generating coarse segmentation
when thin cracks appear because of up-sampling. The
reason of the underestimation on crack length is that
thin cracks especially reticulation cracks are missed by
AFFNet.
Hang et al.
3021
Figure 14. Quantification of concrete crack images at a pixel level: (a) crack area, (b) crack length and (c) crack mean width.
Figure 15. The crack image acquisition process.
In order to further evaluate the effectiveness of the
proposed algorithm and to obtain the geometric information of the actual concrete cracks, a new crack dataset called AFF-D (AFFNet datasets) was collected. The
crack image acquisition process is shown in Figure 15,
the concrete crack datasets were obtained by IPHONE
camera and the distance from camera to the concrete
surface was set at 30 cm by using laser rangefinder.
Then, a crack width meter was used to measure the
actual size of concrete crack. After performing the
above operation, more than 1700 concrete crack images
with a resolution of 2243 224 pixels were obtained.
Built on the proposed AFFNet algorithm, the crack
morphological features such as crack area, crack
length, and crack mean width can be calculated for the
AFF-D dataset. As mentioned above, the crack can be
skeletonized into the thin lines with a single-pixel width
and then the crack geometric information can be
obtained by calculating the number of pixels in thin
lines. Then, the actual crack area, length, and mean
width could be obtained by multiplying the area and
length represented by each pixel using the camera calibration parameter k.
In Table 6, the geometric information of the actual
concrete cracks was calculated, and the damage level of
the crack was evaluated based on the actual crack area,
which were valuable indicators for inspectors to evaluate and monitor the structural health quantitatively.
3022
Structural Health Monitoring 22(5)
Table 6. Calculation of crack parameters and evaluation of damage levels.
Crack area (mm2)
Max crack width (mm)
Crack 1
1402
15.37
Mild damage
Crack 2
2568
37.65
Moderate damage
Crack 3
3696
42.76
Serious damage
Crack type
Original image
Segmentation image
Evaluation of AFFNet using other dataset
To further verify the effectiveness of AFFNet, a
robustness analysis is performed. Here two new datasets, for example, DeepCrack57 and SDNET2018 datasets, are used for crack detection with five selected
models. The images in two datasets contain a variety
of noise, which is different from our built dataset. For
example, obstructions in DeepCrack include surface
roughness and mark. Meanwhile, obstructions in
SDNET2018 include holes and low lightening. It
should be noted that these two datasets are not utilized
to train AFFNet prior to the test, with the aim to
examine the model robustness. The crack images in
datasets need to be resized to 224 3 224 pixels due to
the requirement of asymmetric convolution in VHCAM.
Table 7 lists the performance of five models tested
by the DeepCrack dataset containing 527 crack images.
Compared with other models, AFFNet achieves the
highest MIoU of 82.28%, with a distinct margin of at
least 4.34% than other models. As shown in Figure 16,
four characteristic crack images are selected to display
the prediction results. From left to right, the types of
Damage level
Table 7. Segmentation results of five models on DeepCrack
dataset (%).
Models
PA
MPA
MIoU
FWIoU
U-Net
Dilated FCN
DeepLabv3 +
PAN
AFFNet
98.28
98.16
98.52
98.38
98.73
83.28
80.24
83.61
81.08
90.78
77.82
75.05
77.94
75.8
82.28
96.82
96.69
97.29
96.69
97.78
FCN: fully convolutional network; AFFNet: attention-based feature
fusion network; PA: pixel accuracy; MPA: mean pixel accuracy; MIoU:
mean intersection over union; FWIoU: frequency weighted intersection
over union.
cracks are reticulation crack, crack with white line,
crack with joint, and crack with handwriting. It can be
seen that AFFNet can effectively detect cracks, including the cracks with rough background.
Besides the DeepCrack dataset, the SDNET2018
dataset is also used to test the effectiveness of AFFNet.
A total of 50 randomly selected crack images are
resized to 224 3 224 pixels with RGB channels and
then manually labeled as test images. Table 8 lists the
Hang et al.
3023
Figure 16. Prediction results in the DeepCrack dataset using AFFNet: (a) reticulation crack, (b) crack with white line, (c) crack
with joint, and (d) crack with handwriting.
AFFNet: attention-based feature fusion network.
performance of five models tested by the SDNET2018
dataset, which has shown that AFFNet achieves the
highest MIoU of 89.21%. Figure 17 illustrates four
typical concrete crack images in the SDNET2018 dataset, including transverse crack, crack with low lightening and crack with holes. The prediction results have
shown that AFFNet has strong robustness regardless
the conditions the crack attached.
Conclusion
In order to cope with complex conditions around the
concrete structure, this paper implements a novel DLbased framework, namely AFFNet, for automatic concrete crack detection at the pixel level. In particular, the
Table 8. Segmentation results of five models on SDNET2018
dataset (%).
Models
PA
MPA
MIoU
FWIoU
U-Net
Dilated FCN
DeepLabv3 +
PAN
AFFNet
99.21
98.98
99.27
99.12
99.48
91.16
87.03
88.78
87.76
95.21
84.17
80.97
83.65
81.58
89.21
98.46
98.07
98.63
98.38
99.04
FCN: fully convolutional network; AFFNet: attention-based feature fusion
network; PA: pixel accuracy; MPA: mean pixel accuracy; MIoU: mean
intersection over union; FWIoU: frequency weighted intersection over union.
proposed AFFNet consists of ResNet101 as backbone
and two attention mechanism modules, including the
Figure 17. Prediction results in the SDNET2018 dataset using AFFNet: (a) transverse crack, (b) crack with low lightening,
(c) crack with big holes, and (d) crack with tiny holes.
AFFNet: attention-based feature fusion network.
3024
VH-CAM and the ECAUM. Specifically, the VHCAM uses two convolution layers of kernel size 1 3 W
and H 3 1 to make each pixel obtain more information and then generate the attention map through the
matrix multiplication to capture rich contextual information. The ECAUM provides rich contextual information to guide low-level features.
The effectiveness and robustness of AFFNet are
verified by a concrete crack dataset after a serious of
experiments. The experimental results show that two
attention mechanisms can contribute a better performance in crack segmentation. The proposed model
achieves the highest MIoU of 84.49% in comparison
with other existing models, including U-Net, Dilated
FCN, DeepLabv3 + , and PAN. In addition, a robustness analysis is also conducted using DeepCrack and
SDNET2018 datasets. The prediction results show that
the proposed model can also maintain an accurate segmentation performance in detecting cracks with
untrained dataset.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with
respect to the research, authorship, and/or publication of this
article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this
article: This research was supported by the Youth fund project of Jiangsu Natural Science Foundation (No.
BK20180708) and Science and Education Integration
Innovation Pilot Program from Qilu University of
Technology (Shandong Academy of Sciences)–International
Collaboration Project (2022GH006).
ORCID iDs
Jiaqi Hang
https://orcid.org/0000-0001-8233-1089
Yancheng Li
https://orcid.org/0000-0002-6720-8493
Reference
1. Jiang S and Zhang J. Real-time crack assessment using
deep neural networks with wall-climbing unmanned aerial system. Comput-Aided Civ Infrastruct Eng 2019; 35(6):
549–564.
2. Nishikawa T, Yoshida J, Sugiyama T, et al. Concrete
crack detection by multiple sequential image filtering.
Comput-Aided Civ Infrastruct Eng 2012; 27(1): 29–47.
3. Hoang N-D. Detection of surface crack in building structures using image processing technique with an improved
Otsu method for image thresholding. Adv Civ Eng 2018;
2018: 1–10.
Structural Health Monitoring 22(5)
4. Ni F, Zhang J and Chen Z. Zernike-moment measurement of thin-crack width in images enabled by dual-scale
deep learning. Comput-Aided Civ Infrastruct Eng 2018;
34(5): 367–384.
5. Yeum CM and Dyke SJ. Vision–based automated crack
detection for bridge inspection. Comput-Aided Civ Infrastruct Eng 2015; 30(10): 759–770.
6. Fujita Y and Hamamoto Y. A robust automatic crack
detection method from noisy concrete surfaces. Mach
Vision Appl 2010; 22(2): 245–254.
7. Oliveira H and Correia PL. Automatic road crack segmentation using entropy and image dynamic thresholding. In: 2009 17th European signal processing conference,
Glasgow, Scotland, 2009, pp. 622–626.
8. Dhule JJ, Dhurpate NB, Gonge SS, et al. Edge detection
technique used for identification of cracks on vertical
walls of the building. In: 2015 international conference on
computing and network communications (CoCoNet), Trivandrum, India, 2015, pp. 263–268.
9. Abdel-Qader I, Abudayyeh O and Kelly Michael E. Analysis of edge-detection techniques for crack identification
in bridges. J Comput Civ Eng 2003; 17(4): 255–263.
10. Merazi-Meksen T, Boudraa M and Boudraa B. Mathematical morphology for TOFD image analysis and
automatic crack detection. Ultrasonics 2014; 54(6):
1642–1648.
11. Giakoumis I, Nikolaidis N and Pitas I. Digital image
processing techniques for the detection and removal of
cracks in digitized paintings. IEEE Trans Image Process
2006; 15(1): 178–188.
12. Krizhevsky A, Sutskever I and Hinton GE. ImageNet
classification with deep convolutional neural networks.
Commun ACM 2012; 60: 84–90.
13. Wang W, Hu W, Wang W, et al. Automated crack severity level detection and classification for ballastless track
slab using deep convolutional neural network. Autom
Constr 2021; 124: 103484.
14. Rao AS, Nguyen T, Palaniswami M, et al. Vision-based
automated crack detection using convolutional neural
networks for condition assessment of infrastructure.
Struct Health Monit 2021; 20(4): 2124–2142.
15. Deng J, Lu Y and Lee VC-S. Imaging-based crack detection on concrete surfaces using You Only Look Once network. Struct Health Monit 2021; 20(2): 484–499.
16. Hsieh Y-A and Tsai YJ. Machine learning for crack
detection: review and model performance comparison. J
Comput Civ Eng 2020; 34(5): 4020038.1–4020038.12.
17. Alipour M, Harris DK and Miller GR. Robust pixel-level
crack detection using deep fully convolutional neural networks. J Comput Civ Eng 2019; 33(6): 04019040.
18. Zhang L, Shen J and Zhu B. A research on an improved
Unet-based concrete crack detection algorithm. Struct
Health Monit 2020; 20(4): 1864–1879.
19. Huyan J, Li W, Tighe S, et al. CrackU-net: A novel deep
convolutional neural network for pixelwise pavement
crack detection. Struct Control Health Monit 2020; 27(8):
e2551.
Hang et al.
20. Yang X, Li H, Yu Y, et al. Automatic pixel-level crack detection and measurement using fully convolutional network.
Comput-Aided Civ Infrastruct Eng 2018; 33(12): 1090–1109.
21. Zhang A, Wang KCP, Li B, et al. Automated pixel-level
pavement crack detection on 3D asphalt surfaces using a
deep-learning network. Comput-Aided Civ Infrastruct
Eng 2017; 32(10): 805–819.
22. Huyan J, Ma T, Li W, et al. Pixelwise asphalt concrete
pavement crack detection via deep learning-based semantic segmentation method. Struct Control Health Monit.
Epub ahead of print 5 April. DOI: 10.1002/stc.2974.
23. Li S, Zhao X and Zhou G. Automatic pixel-level multiple damage detection of concrete structure using fully
convolutional network. Comput-Aided Civ Infrastruct
Eng 2019; 34(7): 616–634.
24. Zhang A, Wang Kelvin CP, Fei Y, et al. Deep learning–
based fully automated pavement crack detection on 3D
asphalt surfaces with an improved CrackNet. J Comput
Civ Eng 2018; 32(5): 04018041.
25. Chen L-C, Papandreou G, Schroff F, et al. Rethinking
Atrous convolution for semantic image segmentation.
arXiv e-prints 2017. arXiv:1706.05587.
26. Zhao H, Shi J, Qi X, et al. Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA,
2017, pp. 2881–2890.
27. Wang JJ, Liu YF, Nie X, et al. Deep convolutional
neural networks for semantic segmentation of cracks.
Struct Control Health Monit 2022; 29(1): e2850.
28. Ji A, Xue X, Wang Y, et al. Image-based road crack riskinformed assessment using a convolutional neural network and an unmanned aerial vehicle. Struct Control
Health Monit 2021; 28(7): e2749.
29. Bahdanau D, Cho K and Bengio Y. Neural machine
translation by jointly learning to align and translate.
arXiv e-prints. 2014: arXiv:1409.0473.
30. Hu J, Shen L, Albanie S, et al. Squeeze-and-excitation
networks. IEEE Trans Pattern Anal Mach Intell 2020;
42(8): 2011–2023.
31. Woo S, Park J, Lee J-Y and Kweon IS. CBAM: Convolutional block attention module. In: Proceedings of
the European conference on computer vision, Munich,
Germany, 2018, pp. 3–19.
32. Fu J, Liu J, Tian H, et al. Dual attention network for
scene segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR),
Long Beach, CA, USA, 2019, pp. 3141–3149.
33. Chen J and He Y. A novel U-shaped encoder–decoder
network with attention mechanism for detection and evaluation of road cracks at pixel level. Comput-Aided Civ
Infrastruct Eng. Epub ahead of print 18 February 2022.
DOI: 10.1111/mice.12826.
34. Fang J, Qu B and Yuan Y. Distribution equalization
learning mechanism for road crack detection. Neurocomputing 2021; 424: 193–204.
35. Xu S, Hao M, Liu G, et al. Concrete crack segmentation
based on convolution–deconvolution feature fusion with
3025
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
holistically nested networks. Struct Control Health
Monit. Epub ahead of print 23 March 2022. DOI:
10.1002/stc.2965.
Pan Y, Zhang G and Zhang L. A spatial-channel hierarchical deep learning network for pixel-level automated
crack detection. Autom Constr 2020; 119: 103357.
Li G, Liu Q, Ren W, et al. Automatic recognition and
analysis system of asphalt pavement cracks using interleaved low-rank group convolution hybrid deep network
and SegNet fusing dense condition random field. Measurement 2021; 170: 108693.
Bhowmick S, Nagarajaiah S and Veeraraghavan A.
Vision and deep learning-based algorithms to detect and
quantify cracks on concrete surfaces from UAV videos.
Sensors 2020; 20(21): 6299.
He K, Zhang X, Ren S, et al. Deep residual learning for
image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR),
Las Vegas, NV, USA, 2016, pp. 770–778.
Zhang Z, Lan C, Zeng W, et al. Relation-aware global
attention for person re-identification. In: Proceedings of
the IEEE conference on computer vision and pattern recognition(CVPR), Seattle, WA, USA, 2020, pp. 3183–3192.
Zhou Z, Zhou Y, Wang Mu J, et al. Self-attention feature
fusion network for semantic segmentation. Neurocomputing 2021; 453: 50–59.
Zhang Z, Zhang X, Peng C, et al. ExFuse: enhancing
feature fusion for semantic segmentation. Proceedings
of the European conference on computer vision(ECCV),
Munich, Germany, 2018, pp. 269–284.
Long J, Shelhamer E and Darrell T. Fully convolutional
networks for semantic segmentation. In: Proceedings of
the IEEE conference on computer vision and pattern recognition(CVPR), Boston, MA, USA, 2015, pp. 3431–3440.
Ronneberger O, Fischer P and Brox T. U-Net: Convolutional networks for biomedical image segmentation. Med
Image Comput Comput-Assist Interv 2015; 9351: 234–241.
Badrinarayanan V, Kendall A and Cipolla R. SegNet: a
deep convolutional encoder-decoder architecture for
image segmentation. IEEE Trans Pattern Anal Mach
Intell 2017; 39(12): 2481–2495.
Wang Q, Wu B, Zhu P, et al. ECA-Net: efficient channel
attention for deep convolutional neural networks. In:
Proceedings of the IEEE conference on computer vision
and pattern recognition (CVPR), Seattle, WA, USA,
2020, pp. 11531–11539.
Goodfellow I, Bengio Y and Courville A. Deep learning,
vol. 301. Cambridge, MA: MIT Press, 2016.
He K, Zhang X, Ren S, et al. Delving deep into rectifiers:
surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference
on computer Vision, Santiago, Chile, 2015, pp. 1026–1034.
Garcia-Garcia A, Orts-Escolano S, Oprea S, et al. A
review on deep learning techniques applied to semantic
segmentation. arXiv e-prints. 2017. arXiv:1704.06857.
Sutskever I, Martens J, Dahl G, et al. On the importance
of initialization and momentum in deep learning. In:
3026
Proceedings of the 30th international conference on
machine learning (PMLR), Atlanta, GA, USA, 2013.
pp.1139–1147.
51. Li G, Ma B, He S, et al. Automatic tunnel crack detection
based on U-net and a convolutional neural network with
alternately updated clique. Sensors 2020; 20(3): 717.
52. Liu Z, Cao Y, Wang Y, et al. Computer vision-based
concrete crack detection using U-net fully convolutional
networks. Autom Constr 2019; 104: 129–139.
53. Chen L-C, Zhu Y, Papandreou G, et al. Encoder-decoder
with Atrous separable convolution for semantic image
segmentation. In: Proceedings of the European conference
on computer vision (ECCV), Munich, Germany, 2018,
pp. 801–818.
Structural Health Monitoring 22(5)
54. Li H, Xiong P, An J, et al. Pyramid attention network
for semantic segmentation. arXiv e-prints 2018.ArXiv:
1805.10180.
55. Selvaraju RR, Cogswell M, Das A, et al. Grad-CAM:
visual explanations from deep networks via gradientbased localization. In: Proceedings of the IEEE international conference on computer vision, Venice, Italy, 2017,
pp. 618–626.
56. Lam L, Lee SW and Suen CY. Thinning methodologiesa comprehensive survey. IEEE Trans Pattern Anal Mach
Intell 1992; 14(9): 869–885.
57. Liu Y, Yao J, Lu X, et al. DeepCrack: a deep hierarchical feature learning architecture for crack segmentation.
Neurocomputing 2019; 338: 139–153.
Download