Uploaded by Ahmet KORKMAZ

1-s2.0-S089360802030263X-main

advertisement
Neural Networks 131 (2020) 64–77
Contents lists available at ScienceDirect
Neural Networks
journal homepage: www.elsevier.com/locate/neunet
ASSAF: Advanced and Slim StegAnalysis Detection Framework for JPEG
images based on deep convolutional denoising autoencoder and
Siamese networks
Assaf Cohen a,b , Aviad Cohen a , Nir Nissim a,c ,
a
b
c
∗
Malware Lab, Cyber Security Research Center, Ben-Gurion University of the Negev, Israel
Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Israel
Department of Industrial Engineering and Management, Ben-Gurion University of the Negev, Israel
article
info
Article history:
Received 4 October 2019
Received in revised form 18 May 2020
Accepted 16 July 2020
Available online 29 July 2020
Keywords:
Steganography
Steganalysis
Deep learning
Autoencoder
Siamese neural network
Convolution neural network
a b s t r a c t
Steganography is the art of embedding a confidential message within a host message. Modern
steganography is focused on widely used multimedia file formats, such as images, video files, and
Internet protocols. Recently, cyber attackers have begun to include steganography (for communication
purposes) in their arsenal of tools for evading detection. Steganalysis is the counter-steganography
domain which aims at detecting the existence of steganography within a host file. The presence of
steganography in files raises suspicion regarding the file itself, as well as its origin and receiver, and
might be an indication of a sophisticated attack. The JPEG file format is one of the most popular
image file formats and thus is an attractive and commonly used carrier for steganography embedding.
State-of-the-art JPEG steganalysis methods, which are mainly based on neural networks, are limited
in their ability to detect sophisticated steganography use cases. In this paper, we propose ASSAF, a
novel deep neural network architecture composed of a convolutional denoising autoencoder and a
Siamese neural network, specially designed to detect steganography in JPEG images. We focus on
detecting the J-UNIWARD method, which is one of the most sophisticated adaptive steganography
methods used today. We evaluated our novel architecture using the BOSSBase dataset, which contains
10,000 JPEG images, in eight different use cases which combine different JPEG’s quality factors and
embedding rates (bpnzAC). Our results show that ASSAF can detect stenography with high accuracy
rates, outperforming, in all eight use cases, the state-of-the-art steganalysis methods by 6% to 40%.
© 2020 Elsevier Ltd. All rights reserved.
1. Introduction
Steganography has become a buzzword lately, achieving notoriety for several malicious activities using images. Steganography
is the art of embedding a secret message or payload within
another form of media or communication method without the
awareness of a third person or gatekeeper. Invisible ink is an
example of a traditional steganography method in which specific
lighting is required to detect the secret message, and the original
message remains ‘‘unchanged’’. Modern forms of steganography
focus on computer files, such as multimedia file formats (images,
music files, videos, etc.), in addition to other forms of targeted
files like Office files and communication protocols, as a means of
embedding messages.
∗ Corresponding author at: Malware Lab, Cyber Security Research Center,
Ben-Gurion University of the Negev, Israel.
E-mail address: nirni@bgu.ac.il (N. Nissim).
https://doi.org/10.1016/j.neunet.2020.07.022
0893-6080/© 2020 Elsevier Ltd. All rights reserved.
Recently, a few cyber-attacks have included the use of image
steganography,1 , 2 , 3 which has several advantages in such attack
scenarios. First, it is less detectable. Second, it may confuse the
gatekeeper and investigator, as image files arouse less suspicion.
Third, it allows the attacker to use legitimate social networks to
host steganography images, since images are commonly posted
and shared on social networks and thus are considered harmless.
Thus, there has been an increase in the use of steganography in
the wild (e.g., a recent report of malware that received configuration and commands from its operator via mems images posted
on Twitter Wei, 2018).
Steganalysis is aimed at countering steganography by detecting its existence within a message. While steganalysis does not
reveal the secret message itself, since it could be encrypted, it
does determine whether a message embedded using steganography is present within a file.
1 https://thehackernews.com/2018/12/malware-twitter-meme.html
2 https://thehackernews.com/2016/12/image-exploit-hacking.html
3 https://securelist.com/steganography-in-contemporary-cyberattacks/79276/
A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77
The Joint Photographic Experts Group, or JPEG, is the most
common image file format, primarily due to its lossy compression, which results in good image quality with reduced file size.
Because of its popularity, JPEG images have become favored carriers for steganography.
Several steganography techniques have been developed specifically for JPEG images. J-UNIWARD (Holub, Fridrich, & Denemark,
2014), proposed in 2014, is a sophisticated, adaptive steganography method for JPEG images. Content adaptive steganography
methods take the image’s properties into account when embedding the secret payload, embedding the content in different
places inside the image depending on the image’s properties;
prior to the emergence adaptive steganography, image steganography embedding was accomplished using a pseudo-random key
to determine where the data should be embedded in an image,
without considering the differences between images. J-UNIWARD
is very sneaky, as it attempts to determine the optimal places
within the image to inject the secret payload, so that it leaves
the least footprints, making it more difficult to detect. This form
of adaptive steganography poses challenges to forensics and
steganalizers.
In the last decade, machine learning (ML) and deep neural
networks (DNN) have evolved at a fast pace, and the use of these
models in various domains has increased. This trend can be seen
in both the steganography and steganalysis domains, where more
and more methods have begun to utilize ML and DNN models.
Recent steganalysis research on JPEG images (particularly in
detecting the J-UNIWARD method) used DNN models, some of
which achieved impressive detection rates (i.e., Xu, 2017; Zeng,
Tan, Li, & Huang, 2018). However, the methods proposed have
very complex DNN architectures; thus, the training time is long,
and the architectures lack flexibility. Unfortunately, the performance of these approaches in extreme conditions (e.g., with a
high JPEG quality factor and small embedding rate) deteriorates,
with the detection rate dropping to around 65%, leaving much
room for improvement.
In this paper, we propose ASSAF, the Advanced and Slim
StegAnalysis Detection Framework for JPEG images based on deep
convolutional denoising autoencoder and Siamese networks,
which is a novel deep neural network architecture composed
of a convolutional denoising autoencoder and a Siamese neural
network, for the detection of steganography in JPEG images
embedded using J-UNIWARD. ASSAF is much lighter than architectures proposed in previous research and thus requires much
less training time. In addition, to the best of our knowledge, we
are the first to combine a denoising autoencoder and Siamese
neural network in one architecture in order to solve a binary
classification task.
The major contributions of this paper are:
1. We combine a denoising autoencoder and Siamese neural
network and leverage this novel pairing to address a binary
classification task. This combination allows us to use both
the original input data as well as the processed data as
input to the Siamese neural network classifier.
2. We propose ASSAF, a novel deep neural network architecture, which demonstrates that the denoising autoencoder and Siamese neural network combination can efficiently detect J-UNIWARD’s adaptive steganography in
JPEG images.
3. ASSAF significantly improves the detection capability of
J-UNIWARD steganalysis, outperforming existing state-ofthe-art methods.
The rest of the paper is organized as follows. In Section 2,
we provide the necessary background for our work. In Section 3,
we survey relevant related work from our field of research. We
present ASSAF in Section 5 and evaluate it in Section 5. In Section 7, we discuss and conclude this research.
65
2. Background
In the following section, we provide the background necessary
for the reader to understand this study. We elaborate on several
topics that will be covered in this paper, in order to improve
understanding of our novel architecture.
2.1. JPEG image file format
The most common image file format, the JPEGs lossy format
provides a high compression ratio while providing image quality. JPEG file compression is based on several steps, including
performing discrete cosine transformation (DCT), quantization,
and encoding (Taubman, 2002). The image quality factor (QF)
is one of the parameters of the JPEG file compression process.
The QF determines the quality of the compressed image and the
amount of compression to be applied. A large QF means that
there is less compression and thus less data loss during the compression process, however a large QF also provides more space
for steganography within a JPEG file and a bigger embedding
‘‘surface’’.
2.2. Image steganography
Image steganography is a subdomain of steganography which
focuses on embedding a secret message or payload within an
image file. Image steganography research has been performed on
several file formats, including PNG, BMP, and JPEG.
Least significant bit (LSB) steganography is one of the most
popular image steganography approaches today. Embedding the
payload in the LSB means that the secret payload will be embedded within each pixel in the bit that influences the pixel the
least, thus limiting its effect on the original image’s visual quality.
Extensive research has been conducted on LSB steganography
(e.g., Akhtar, Johri, & Khan, 2013; Huang, Zhong, & Huang, 2014;
Mielikainen, 2006; Sharp, 2001), a form of steganography that is
easy to implement and allows a significant payload size but is also
relatively easy to detect.
In contrast, the frequency steganography approach uses various transformation techniques (such as DCT Ahmed, Natarajan, &
Rao, 1974) to transform the image into the frequency space and
injects the secret payload into the frequency space. This form of
steganography is much more secure and is more difficult to detect, however it offers less steganography space (i.e., space to embed data). Research Coatrieux, Pan, Cuppens-Boulahia, Cuppens,
and Roux (2013), Qin, Chang, Huang, and Liao (2013), Solanki,
Sarkar, and Manjunath (2007) and Tsai, Hu, and Yeh (2009) has
shown that this form of steganography is efficient and is more
secured than the simple LSB approach.
Recently, adaptive steganography, a more sophisticated approach of steganography, emerged. Older forms of steganography
embedded the secret message in the host images in the same
form and did not take the image’s properties (such as monotonic
areas that might enable the detection of embedded messages)
into consideration. In adaptive steganography methods this is not
the case, as the methods examine the properties of the image and
identify the optimal spots in the image for embedding (e.g., the
‘‘noisier’’ parts of the image are good candidates for embedding,
since changes in those areas of the image will be harder to detect). Adaptive steganography methods are more complicated to
implement and have been shown to be harder to detect (Cheddad,
Condell, Curran, & Mc Kevitt, 2010). One of the most well-known
adaptive steganography techniques is the UNIWARD method and
its JPEG variant, the J-UNIWARD method (Holub et al., 2014).
Other forms of adaptive steganography include WOW (Holub &
Fridrich, 2012) and HUGO (Pevný, Filler, & Bas, 2010).
66
A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77
2.3. Image steganalysis
As the use of steganography methods for images has increased,
the need to analyze and detect those method has grown as well.
Image steganalysis focuses on revealing the existence of such
embedded data in an image, rather than on revealing the actual
content of the embedded payload. Over the years, several steganalysis techniques have been studied, and currently the trend
in steganalysis research is to use machine learning and deep
neural network algorithms.
2.4. Deep neural networks
An artificial neural network (ANN) is a computing system
inspired by the biological human brain (Zurada, 1992). Neural
network architecture is composed of several neurons, each of
which includes an activation function that ‘‘fires’’ that neuron,
mimicking the neuron function in the human brain. The activation function of the neurons (also called the nonlinear properties)
allows it to describe simple nonlinear functions. The combination
of several neurons provides the ability to describe more complex
and sophisticated functions. Neural networks are composed of input, output, and hidden layers; each layer is composed of several
neurons that are connected to neurons in previous layers, but
usually there are no connections between neurons in the same
layer.
A deep neural network (DNN), or deep learning (DL), is a
neural network with more than two hidden layers. Those networks are able to learn very complex functions or situations. Deep
learning architecture is widely used, and a variety of applications
in different fields are based on this architecture (e.g., image processing, autonomous driving). Such technological advancements
have resulted in a significant increase in the development of
steganalysis methods based on deep learning architectures which
is usually composed of convolution neural networks (CNN). A
CNN is a form of deep learning that uses convolutions of filters
applied on the image in order to perform some sort of processing
of the input image. The filters of the CNN are learned during the
network’s training process in order to find the best set of filters
to fit a specific task. CNNs are a type of deep learning commonly
used in the image and video processing domains and thus is also
popular in the steganalysis domain.
2.4.1. Denoising Autoencoder (DAE)
An autoencoder (AE) is a neural network architecture that
is composed of two parts, an encoder and a decoder. The encoder transforms the input data (e.g., an image) into a latent
representation which is smaller than the original input (thus
also performing compression), identifying the meaningful features of the input. The decoder transforms the latent representation into the original data. Therefore, the autoencoder performs a
compression–decompression process.
Denoising autoencoders (DAEs) use the same encoder–decoder
architecture. The DAE’s main purpose is to denoise the input. The
main difference between an AE and a DAE is seen in the training
phase; in the case of a DAE, in the training phase the input image
contains some noise, and the feedback image provided to the
DAE is the same image without the noise. This sort of training
process allows the DAE to learn a latent representation of an
image without the addition of noise, and thus will be able to
remediate the noise that was injected into the input image. Some
examples for the use of DAEs are found in the following papers
(Vincent, 2008; Vincent, Larochelle, Lajoie, Bengio, & Manzagol,
2010).
2.4.2. Siamese Neural Network (SNN)
A Siamese neural network is a form of neural network (NN)
that receives two inputs (instead of one input as usual) and
performs a classification. A Siamese neural network is composed
of two identical legs that embed the input images, which are
followed by one or more classifier layers. Both legs are identical
in that their weights and architecture are the same, and thus
each leg performs the same embedding process on a given input
image; each leg performs the embedding process on one input
image. After the embedding process, the classifier layers perform
the classification task based on the distance or similarity of both
input images’ embeddings. This form of architecture is usually
used for a one-shot classification of images (van der Spoel, et al.,
2015) (i.e., training a NN classifier with a small amount of samples
per class) or face recognition (Chopra, Hadsell, & LeCun, 2005).
An example of Siamese neural network architecture is provided
in Fig. 4.
3. Related work
Universal wavelet relative distortion, or UNIWARD (Holub
et al., 2014), is an adaptive steganography method that has
become quite popular since it was proposed by Holub et al. in
2014. UNIWARD was designed to evade detection by embedding
the secret payload while maintaining the distortion distribution
of the original image. UNIWARD uses direction filters (horizontal,
vertical, and diagonal) to determine the amount of distortion in
the image in order to identify the areas in the image which have
more noise; these areas will then be candidates for embedding
the secret information. The UNIWARD method aims to minimize
the distortion changes that may occur when the secret payload
is embedded into the image. UNIWARD determines the amount
of data that can be injected into the image by the embedding
rate parameter, measured by bits per nonzero AC DCT coefficient (bpnzAC) (Taubman, 2002). As the bpnzAC increases, the
capacity of embedded data increases, and thus it is easier to
detect the presence of steganography. J-UNIWARD is the variant
of the UNIWARD method for JPEG images. In comparison to other
adaptive JPEG steganography methods, J-UNIWARD showed good
evasion performance with a reasonable sized payload. In addition,
J-UNIWARD is quite popular and is the most sneaky JPEG adaptive
steganography method (Zeng et al., 2018). Therefore, we will
primarily focus on the J-UNIWARD steganalysis in this paper.
Fig. 1 demonstrates steganography applied on an image using JUNIWARD: (a) an original image, (b) the steganography image
embedded with a payload using J-UNIWARD, (c) the difference
between the original image and the image embedded with a
payload using J-UNIWARD; the color is brighter as the difference
is greater (black indicates no difference, and white indicates the
areas with the greatest difference). The difference between the
images shown in Fig. 1(c) reflects the changes resulting from
embedding the secret payload. As seen, the secret message is not
evenly distributed across the entire image. In addition, noisier
areas (like edges, which in this case are the rocks in the image)
in the image contain more changes than homogenous areas.
Since the research introducing J-UNIWARD was published,
several studies were performed in order to develop efficient steganalysis techniques aimed at this state-of-the-art steganography
method. There are two major approaches: designing handcrafted
features for steganography detection and the use of deep learning
methods.
Recently, DL methods have gained popularity in a large variety
of domains, including the steganalysis domain. The main advantage of DL is its flexibility. In contrast to other architectures that
needed customized handcrafted features as input, deep learning
architectures receive the raw input data and identify the relevant
A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77
67
Fig. 1. The difference between images with and without a J-UNIWARD steganography payload, with QF of 75 and a bpnzAC of 0.4: (a) the original image, (b) the
stegno-image embedded with a payload using J-UNIWARD, (c) the difference between images a and b; black indicates no difference, and white indicates the areas
with the greatest difference.
Source: The image is from the BOSSBase (Bas, Filler, & Pevný, 2011) dataset.
features for the desired task during the training phase. Therefore,
DL architectures have the potential to improve the steganography
detection rate by learning the characteristics of the images with
embedded payloads and non-steganographic images.
One of the first approaches that included AE was proposed
by Tan and Li (2014) in 2014. In their method, several layers
of convolutional autoencoders were stacked in order to detect
steganography within images (not just in JPEGs in particular). Unfortunately, this novel method was unable to outperform existing
steganalysis techniques.
State-of-the-art methods that use deep learning for the detection of J-UNIWARD steganography in JPEG images are described
below.
In 2017, Chen, Sedighi, Boroumand, and Fridrich (2017) proposed a DL JPEG steganalysis method which uses a convolutional neural network. Two network configurations were proposed: PNet and VNet. The high-level architecture of the method
includes five groups of convolutional layers with batch normalization (BN) and TanH/ReLU activation functions; some of the
groups also include average pooling. After the five groups there is
a fully connected layer, which is followed by a softmax layer that
performs the classification. The main difference between VNet
and PNet is that PNet is composed of several parallel convolution
layers (which is not a very common DL architecture), while VNet
has no parallel layers but has more filters per layer than PNet.
For example, in one of the layers of PNet there are 64 different
parallel convolutional layers where each layer includes 64 filters
(which have a total of 4,096 convolutional filters); the equivalent
layer in VNet has 1,024 filters. The large amount of layers in
both PNet and VNet result in a very complex network with
many parameters to train. This architecture improved upon the
performance of existing methods that used handcrafted features,
however it is much more complex and thus takes a lot of time to
train and maintain.
Another DL architecture proposed by Zeng et al. (Zeng-Net)
(Zeng et al., 2018) in 2018 includes several convolution layers in
several subnets, in addition to a large fully connected layer. Since
the proposed architecture is so massive and complex, the authors
had to train it on a relatively large dataset of images containing
at least 50K images, in contrast to other research that used a
dataset of around 10K images. Their results also outperformed
handcrafted feature-based methods detection rate, but left room
for improvement in terms of the bpnzAC ratios which were low.
Additional research in JPEG steganalysis was performed by Xu
(2017) in 2018. He proposed a DNN architecture (also referred to
as J-XuNet) composed of 20 convolutional layers. In this method,
the first layer is a DCT layer that performs 16 DCT filters in
order to perform preprocessing on the input image, as previous
research (Holub & Fridrich, 2015) found that doing so improves
the detection rate of the network. This study demonstrated the
best performance in high bpnzAC ratios, however the proposed
method suffers at low bpnzAC ratios, providing accuracy close
to that of random choice. In addition, like other prior work
mentioned above, is very complex and takes a long time to train
and converge (e.g., in order to converge on the BOSSBase dataset
it took this model nearly 300 epochs).
SRNet (Boroumand, Chen, & Fridrich, 2019), a new state-ofthe-art method, was proposed by Boroumand et al. in 2019.
SRNet is composed of 12 convolution layers in four different
architectures and uses a combination of batch normalization,
residual links, and average pooling. In contrast to prior work
such as Holub and Fridrich (2015) and Xu (2017), SRNet did
not use known filters or perform DCT transformation preprocessing on the input images. Experimental results showed better
performance than other existing state-of-the-art methods. Nevertheless, this method is still very complex and includes a large
number of layers and parameters and thus takes a long time to
train (e.g., 2.5 days on a 20K image dataset). In addition, the
performance of this method in challenging steganography use
cases is relatively low, with a detection rate approaching 70%.
We believe that it is possible to achieve improved steganography detection by using the novel deep neural network architecture proposed in this paper, leveraging both a DAE and Siamese
neural network.
4. ASSAF’s deep neural network architecture
Our proposed architecture for the efficient detection of
steganography within JPEG images embedded using J-UNIWARD,
is composed of two phases. The first phase uses a denoising
autoencoder which performs preprocessing on the input image.
Later, in the second phase, we use a Siamese neural network
classifier in order to determine whether the original image contains steganography. The Siamese neural network does this by
measuring the distance between the original input image and
the DAE processed image. The rationale behind the proposed
architecture is that the preprocessing performed on the image
can be learned by the DAE during training. This is in contrast
to the use of fixed known filters as was done in previous work,
such as (Holub & Fridrich, 2015) and Xu (2017). Furthermore, our
proposed architecture does not identify the steganography artifacts directly like previous work such as Boroumand et al. (2019)
and Xu (2017) but instead our approach learns the meaningful
distance between the representations of those two images. We
68
A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77
Fig. 2. High-level architecture of the proposed method.
are the first to use a Siamese neural network in the steganalysis
domain, making our approach revolutionary and novel. In this
section, we provide a detailed description of the architecture’s
two phases. Fig. 2 illustrates the framework; as can be seen, phase
1 is composed of the DAE, and phase 2 includes the Siamese
neural network and the final classification, determining whether
the input image contains a steganographic payload or not.
4.1. Denoising Autoencoder (DAE)
The major role of the DAE in the ASSAF method’s architecture
is to perform preprocessing on the input image which will emphasize the existence of steganography within the input image.
The DAE attempts to ‘‘correct’’ some of the noise that was injected
into the image during the embedding process. In so doing, it
makes a distinction between images in which steganography
exists and other images in which steganography does not exist
(referred to here as non-steganographic images), allowing the
Siamese neural network to identify those subtle changes and
making the correct classification. Previously proposed methods
like Xu’s and others (Holub & Fridrich, 2015; Xu, 2017) used
predefined DCT filters in order to preprocess the image. In our
framework, we trained a DAE to learn the relevant filters for
the preprocessing. This form of preprocessing is crucial for the
Siamese neural network to correctly identify the presence or lack
of steganography within the input image.
The DAE is composed of two parts, an encoder and a decoder.
The input for the DAE is a 512X512X1 pixel image. The encoder
has three convolution layers which embed the input image into a
latent representation that is smaller than the original image by
a factor of 3.2. The decoder receives the latent representation
of the image as input and decodes it to the original size. The
decoder uses three transposed convolution layers with the same
properties as the respective convolution layers of the encoder.
Fig. 3 presents the DAE architecture. In total, the DAE contains
six layers with 23,229 parameters.
The encoder contains the following convolution layers. The
first layer is composed of four filters with a seven by seven kernel
size (a.k.a. filter size). The second layer has 10 filters and a five
by five kernel size. The third and last layer of the encoder has 20
filters with a three by three kernel size. In addition, each of the
layers has a stride of two, and one-pixel padding.
After the latent representation has been created by the encoder, the decoder performs the decompression task with the
following transpose convolution layers. The first layer of the
decoder has similar properties to the last layer of the encoder: it
has 20 filters with a three by three kernel size. The second layer
has 10 filters and a five by five kernel size. The last layer of the
decoder has four filters with a seven by seven kernel (which is
similar to the first layer of the encoder). Just like the encoder,
each of the layers in the decoder also has a stride of two and
one-pixel padding. The output layer contains a single filter of
transpose convolution layer with one by one kernel. The output
of the DAE is a 512X512X1 pixels sized image, similar to the input
image dimensions.
The process of training the DAE is done by providing images in
which steganography is present as input; the same image without
the presence of steganography is the expected output. This form
of training enables the DAE to identify the subtle changes to the
image resulting from the steganography process; essentially the
DAE tries to ‘‘denoise’’ those changes.
Once the DAE has been trained, each of ASSAF’s input images
is processed by the trained DAE. Later, the architecture’s second
phase, both the DAE processed image and the original input
image will serve as the input of the Siamese neural network.
4.2. The Siamese Neural Network (SNN)
The second phase of our architecture contains a Siamese neural network which is responsible for the classification of the JPEG
images. The input for the SNN is the original image (that might
be embedded with steganography or not) and the DAE processed
image. The SNN embeds the two input images in order to identify
a meaningful distance between the representations of those two
images. The representations are made by the SNN legs. The SNN
legs are identical: they have the same network architecture and
model weights. Thus, the embedding process performed on both
input images is the same (and is learned during the training
process).
Each of the SNN legs is composed of the following convolution
layers. The first convolution layer has 16 filters with a kernel size
of seven by seven with one-pixel padding and a stride of two.
After the convolution layer, there is a batch normalization layer.
The second convolution layer has 32 filters with a kernel size
of five by five with one-pixel padding and a stride of one. After
this layer there is a batch normalization layer following by a 1D
max pooling with a stride of two. The third convolution layer
contains 64 filters with a kernel size of three by three with a
stride of one and one-pixel padding. This is followed by a batch
normalization layer and 1D max pooling with a stride of two. The
fourth and final convolution layer contains 16 filters with a one
by one kernel size with a stride of one and one-pixel padding. This
layer performs dimensionality reduction in order to decrease the
number of parameters needed and prevent model overfitting.
At the end of each leg there is a flattening layer. The classification component of the SNN is composed of a distance layer
which calculates the distance between the outputs of both legs’
flattening layers; the distance is the absolute difference between
the embedding representation vectors produced by both legs.
The distance layer is followed by a 100 neuron fully connected
layer with an ReLU activation function and a 0.2 dropout, and
finally a sigmoid neuron which performs the final classification.
The Siamese neural network’s general architecture is presented in
Fig. 4. A more detailed illustration of the architecture is presented
in Fig. 12 in the Appendix.
A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77
69
Fig. 3. The denoising autoencoder architecture.
Source: The image is from the BOSSBase (Bas
et al., 2011) dataset.
Fig. 4. The architecture of the Siamese neural network.
Source: The image is from the BOSSBase (Bas et al., 2011)
dataset).
Fig. 5. Data division into subsets used to train the DAE and SNN, and validate the SNN.
Some aspects of this architecture were inspired from insights
concluded in Xu’s work (Xu, 2017). In Xu’s best performing network, only convolution layers were used (max pooling layers
were not used). We found that the use of only convolution layers
in our model did not improve its performance; however, we
found that using this method (i.e., using only convolution layers)
in the first layer did improve the model’s performance.
Note that the DAE must be fully trained prior to training
the SNN, in order to be able to provide the DAE processed image as the second input of the SNN. During the training phase
we provide the SNN with the correct classification (whether
or not the input image has the presence of steganography or
not) as feedback. That way the SNN can determine the relevant
representation of the images for the classification task.
5. Evaluation
In this section, we evaluate ASSAF and compare it to existing
state-of-the-art methods. We begin by presenting the data collection used for evaluation and then discuss the evaluation metrics
used.
5.1. Data collection and preparation
In our experiments we used the BOSSBase dataset (Bas et al.,
2011), the most popular dataset for steganography. Since its
publication in 2011, it has appeared in almost every respected
steganography and steganalysis research paper. This dataset includes 10,000 non-compressed, grayscale, 512X512 images. We
processed the images as follows: (1) we converted the images to
the JPEG file format at two QFs: 95 and 75; (2) we applied the
J-UNIWARD steganography method with four bpnzAC levels for
each of the QFs: 0.1, 0.2, 0.3, and 0.4, thus creating eight different
use cases. Each of the use cases was then divided randomly into
the various configurations of training and evaluation sets used in
our experiments (described later in the paper).
5.2. Research questions
Our experiments are aimed at answering the following research questions:
1. Is the ASSAF deep learning architecture able to detect JUNIWARD steganography in JPEG images?
2. Does ASSAF provide better detection results than the current state-of-the-art methods?
5.3. Evaluation metrics
In our experiments, we use the following evaluation metrics:
70
A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77
The Accuracy, presented in Eq. (1), assesses our model’s ability
to correctly label (classify) the images. The true positive (TP) is
the amount of steganography images that were correctly labeled
as containing steganography, and the true negative (TN) is the
amount non-steganography images that were correctly labeled as
not containing steganography.
The True positive rate (TPR), presented in Eq. (2), is the rate
of correctly labeling the positive class.
The False positive rate (FPR), presented in Eq. (3), is sometimes referred to as the rate of false alarms. It is calculated by dividing the number of false positives (the instances that the model
incorrectly classified as positive) by the total number of negative
samples (the images that do not contain steganography).
Accuracy =
(True Positiv e + True Negativ e)
Number of samples
(1)
Accuracy equation
TPR =
True Positiv e
Number of positiv e samples
TPR equation
FPR =
False Positiv e
Number of negativ e samples
FPR equation
Fig. 6. Loss during DAE training (with QF 75 and bpnzAC 0.4).
(2)
(3)
The AUC is the area under the receiver operating characteristic
curve (ROC curve). The ROC curve shows the tradeoff between the
TPR and FPR across different threshold values. The AUC score is
a commonly used metric for binary classifier performance in the
machine learning domain.
5.4. Experimental design
In this section, we describe the experiments conducted in
order to evaluate ASSAF, our proposed DNN architecture for JUNIWARD steganography detection. The ASSAF architecture differs from the ‘‘standard’’ deep neural network architecture as it
is composed of two smaller neural networks, and thus requires a
modified approach for training and evaluation. Each of the NNs in
our model must be trained in a different manner with a different
set of images in order to avoid feeding poisoned or already ‘‘seen’’
images between the two NNs. In order to tackle this situation, we
had to plan our evaluation carefully.
The following subsections provide a detailed description of
how ASSAF was trained and evaluated (note that this was done
using the BOSSBase dataset on the eight use cases presented in
Section 5.1 each use case is referred to as an experiment). For
each use case we randomly split the dataset into subsets: (1)
a training set for the DAE, (2) a training set for the SNN, and
(3) a validation set for the final classification using the SNN.
The random split ensures that ASSAF performs well in various
scenarios and presents biases and variance that might stem from
the data split. Fig. 5 illustrates the data division into subsets.
We conducted the experiments using the Python programming language and Keras,4 an open source, high-level neural
network library.
5.4.1. Training ASSAF
Training the DAE
In order to train the DAE, we first randomly selected 5K images
from the BOSSBase dataset. We embedded them with steganography using J-UNIWARD, creating 5K pairs of images (meaning
10K images, 5K of which contain steganography and 5K without).
During the DAE training, we use the 5K steganography images
4 https://keras.io/
as input and provide the matching original images (without the
steganography) as feedback. After the DAE is trained, we use it to
provide the processed input for the SNN training step.
During the training process, we monitored the DAE loss and
mean squared error (MSE) values in each epoch. We observed that
when the loss value is less than 0.485 and the MSE value is 0.01 or
less, the performance of the overall architecture improves. In the
case of higher MSE and loss values (e.g., caused by shorter training
periods), the SNN cannot effectively differentiate between the
steganography and non-steganography images; and this negatively impacts the performance of the overall architecture. We
conducted a preliminary experiment in order to determine the
number of images to use to train the DAE and found that it is
better to use at least 5K images; the DAE training process is less
efficient when fewer images are used, compromising the SNN
classification capabilities. We also found that using additional
images when training the DAE did not significantly improve the
loss value. For example, when training on all 10K images of the
BOSSBase dataset, the loss value plateaued at around 0.47.
Fig. 6 presents the DAE loss during training. As can be seen,
the loss decreases over the epochs and reaches a plateau at a loss
of approximately 0.484. In order to achieve the desired loss and
MSE that were mentioned earlier, the DAE needs about 200–250
epochs of training; further training will not increase the results
dramatically and might cause overfitting.
Training the SNN
From the remaining 5K images of the 10K image BOSSBase
dataset (those which were not used to train the DAE), we randomly selected 4K images in order to train the SNN. We embedded them with steganography using J-UNIWARD, creating
4K pairs of images (meaning 8K images, 4K of which contain
steganography and 4K without). Later we pass these pairs of
images through the trained DAE, creating an additional 4K pairs
of images. In total, we have 16K images divided into four groups,
each of which contains 4K images. The first group contains the 4K
original images; the second group contains the 4K original images
embedded with steganography; the third group contains the 4K
images of the first group after the processing of the trained DAE;
and the fourth group contains the 4K images of the second group
after the processing of the trained DAE.
We trained the SNN using 8K pairs of instances of the same
image (in total 16K images as mentioned earlier), the first instance of each pair is the image that was not processed by the
DAE (groups 1 and 2), and the second instance is the processed
image (groups 3 and 4); the pairs were fed as input into the two
legs of the SNN. Table 1 describes the four group of images.
A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77
71
Fig. 7. The Siamese neural network’s training process: (a) An input image (before DAE) taken from a mixed steganography/non-steganography image pool (b) The
second input image which is image (a) after processing by the DAE (c) The SNN leg; both legs are the same, and hence have the same weights and architecture (d)
The SNN classification layers (e) The output of the SNN is a classification.
Source: The images are from the BOSSBase (Bas et al., 2011) dataset).
Fig. 8. The architecture flow for the evaluation phase (a) An input image for the validation set, a mixed steganography/non-steganography image pool (b) The trained
DAE which performs processing on the input image (c) The second input image which is image (a) after processing by the DAE (d) The SNN trained legs; both legs
are the same, and hence have the same weights and architecture (e) The trained SNN classifier (f) The output of the SNN is a classification.
Source: The images are from the BOSSBase (Bas et al., 2011) dataset.
Table 1
Description of the groups of images use.
5.5. Evaluation of ASSAF
#Group
Size
Contains steg.
Processed with DAE
Training the SNN
1
2
4K
4K
No
Yes
No
No
SNN 1st leg
3
4
4K
4K
No
Yes
Yes
Yes
SNN 2nd leg
The feedback given back to the SNN during training is zero
if the original image did not contain steganography and one
if it contains steganography. Fig. 7 illustrates the SNN training
process.
The Siamese neural network training process, as in most of
DL methods, is performed with minibatches. We conducted a
preliminary experiment in order to find the best training configuration for the SNN. We found that the most effective learning
is achieved when the same image appears twice in a minibatch,
once with steganography and once without steganography. This
configuration helps the model learn only the relevant ‘‘signal’’
of the steganography and prevents it from being fooled by the
visualization of the image itself. Training the Siamese neural
network using a different configuration degrades the detection
performance by at least 10%.
The remaining 1K original images (out of the database’s 10K
images) were then embedded with steganography using
J-UNIWARD, resulting in a total of 2K pairs of images: 1K that
contains steganography, and 1K that does not contain steganography (again, the same pairs as before). Neither the SNN nor the
DAE have seen these 2K images during the training process, and
thus they are used as the validation set for ASSAF. Note that the
validation set contains pairs of the same image (one image without steganography, and the same image with steganography); this
is in contrast to validation sets typically used in which all of the
images are visually different. This is a difficult test, as the model
must not get confused by the image’s visuality when attempting
to correctly classify the images.
Basic Evaluation
We conducted the evaluation process as follows. We randomly
chose the images from the 2K image pool and process it with
the trained DAE, receiving the processed version of the image
as output. Next, we fed the SNN two inputs, the original image
and the processed image. Finally, the classification output of the
SNN determines whether the input image contains a J-UNIWARD
steganography payload. Fig. 8 illustrates the evaluation process
72
A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77
Fig. 9. A comparison between the original images and the denoised stegno-images processed by the DAE (with QF of 75 and a bpnzAC of 0.4), with and without
a J-UNIWARD steganography payload: (a) and (d) the original image, (b) the stegno-image embedded with a payload using J-UNIWARD, (c) the difference between
images a and b (black indicates no difference, and white indicates the areas with the greatest difference). (e) the denoised stegno-image processed by the trained
DAE (f) the difference between images d and e (black indicates no difference, and white indicates the areas with the greatest difference).
Source: The image is from the BOSSBase (Bas et al., 2011) dataset.
Table 2
The validation results of ASSAF architecture in our experiments.
QF
Metric
used were randomly selected and remained the same throughout
the whole experiment.
bpnzAC
0.1
0.2
0.3
0.4
75
TPR
FPR
Accuracy
AUC
0.921
0.097
0.912
0.9691
0.985
0.021
0.982
0.9981
0.994
0.009
0.9925
0.9995
0.993
0.006
0.9935
0.9997
95
TPR
FPR
Accuracy
AUC
0.765
0.158
0.8035
0.8766
0.809
0.108
0.8505
0.9332
0.899
0.15
0.8745
0.9513
0.975
0.051
0.962
0.9931
explained above. As mentioned before this evaluation was done
for each of the 8 use cases presented in Section 5.1.
Evaluating the Robustness of ASSAF against Different QF and
bpnzAC parameters
In addition to the basic evaluation described above, we performed an additional experiment to evaluate the robustness of
the proposed architecture. In this experiment our goal was to
thoroughly evaluate our proposed architecture and assess its
robustness when using different QF and bpnzAC parameters in
the training and test phases.
In order to do so, we used eight different combinations of QF
and bpnzAC parameters, based on two QF levels (75 and 95); for
each QF we used four different bpnzAC levels (0.1, 0.2, 0.3, 0.4),
thus examining a total of eight different combinations.
In each one of the eight QF and bpnzAC combinations, we
trained the architecture and tested it against the other combinations of QF and bpnzAC. In order to ensure a fair comparison
during the cross-validation process, the training and test sets
6. Results
In this section, we provide the results of our comprehensive
evaluation performed on eight different use cases and compare
the evaluation results with the performance of current state-ofthe-art JPEG steganalysis methods.
The results of the eight experiments are presented in Table 2.
The validation results of ASSAF architecture in our experiments.
Recall that each experiment is conducted on a unique combination of the bpnzAC and quality factor. The results show that, as
anticipated, as the bpnzAC increases and the QF decreases, the
accuracy, AUC, and TPR improves, while the FPR decreases. In
the table it can also be observed that the model’s architecture
achieves what is likely its maximal detection accuracy of 0.993
with the combination of QF 75 and higher bpnzAC ratios. In the
extreme case of QF 95 and bpnzAC of 0.1, the model struggles
and obtains a higher FPR and lower accuracy of 0.803, but still
outperforms existing state-of-the-art methods, as we discuss later
on.
Table 3 presents an example of an image with steganography
using J-UNIWARD and without steganography, providing a comparison between the original input image and the DAE output
image, and the difference between the two images for each case.
As can be seen, the DAE processing did not impact the image’s visual quality. Furthermore, a closer look at the images shows small
differences between the steganography and non-steganography
images; We hypothesize that these small differences will be used
by (or will enable) the SNN to differentiate between images that
contain steganography and those that do not.
A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77
73
Table 3
The difference between input and output image from the DAE. The difference between both images
is greater as the color is brighter (white is the biggest difference while black means no difference)
(the image is from the BOSSBase (Bas et al., 2011) dataset).
Fig. 9 provides a comparison between the original image and
the stegno-image (embedded with a payload using J-UNIWARD
and without an embedded payload) and illustrates the difference
between these images. As can be seen from this comparison, the
DAE denoises the noisier areas where it suspects the J-UNIWARD
method will embed the payload, such as the lighthouse itself;
however, as seen in the images showing the difference (Fig. 9(c)
and (f)), J-UNIWARD did not embed much information within
the lighthouse. This shows that the DAE does not transform the
stegno-image to its original form but tries to fix all of the possible
locations within the image that might be populated with the
J-UNIWARD payload. Therefore, the output of the DAE is a preprocessed form of the input image that contains subtle changes
to the image, especially in the noisier areas of the image which
are usually occupied by the J-UNIWARD steganography payload.
The DAE preprocesses the image, an operation that subsequently assists the SNN in the classification task. In the steganalysis domain it is quite common to preprocess the images in
order to improve the steganalysis model’s performance (Holub
& Fridrich, 2015; Xu, 2017). Several studies that employed CNNs
used a fixed set of filters which was empirically found to improve the model’s performance Chen et al. (2017). However, in
our study, instead of using the commonly used fixed filters, we
tried to learn the preprocessing process using an NN. We found
that the DAE performs this task efficiently, especially when later
combined with the SNN.
The process of training the SNN is relatively quick, and the
network usually converges in less than 100 epochs of training,
which takes around two to three hours on a proper GPU.
Fig. 10 presents the model’s accuracy during the training of
the SNN for the combination of a bpnzAC of 0.4 and each QF
evaluated (QF 95 and QF 75). In each epoch of the model’s training
there are 160 weights update iterations. As can be seen, the
accuracy of QF 75 is ‘‘smoother’’ than that of QF 95; this is to be
expected, because the QF 95 is the more difficult use case for the
model. It is also quite clear that the model converges within a few
epochs. With QF 75 there is no difference between the training
accuracy and the validation accuracy, showing that there is no
model overfitting, while this is not the case with QF 95 (e.g., on
the thirtieth epoch, the QF 95 there is model overfitting, but it
later recovers).
Table 4 presents the results of our evaluation of ASSAF’s robustness in terms of detection accuracy. The table presents the
results in the form of a heat map where red symbolizes lower
detection rates and green represents higher detection rates. The
diagonal of the table contains the results of the ASSAF instances
that were trained and tested on the same QF and bpnzAC setting.
In our study, we define robustness as the model’s ability to effectively detect steganography in images that were embedded with
bpnzAC and QF levels that are different than the levels on which
the model was trained on. In the table it can be seen that models
with lower bpnzAC levels are more robust than models with
higher bpnzAC levels. In addition, there was a major difference
in the results when using different QF models; models that were
trained on QF = 95 tend to perform better than models which
were trained on QF = 75, as can be seen in the right most column
(AVG) in Table 4. It also seems that in some cases, the models
with lower bpnzAC levels tend to perform slightly better than the
models that were trained and tested on the same bpnzAC level.
For example, the model that was trained on QF 75 and bpnzAC 0.1
outperformed the model that was trained on QF 75 and bpnzAC
0.2 by 0.6% in terms of detection accuracy. In practice, training
the architecture with QF = 95 and a bpnzAC value of 0.1 was
more likely to detect a steganography payload with a different
combination of bpnzAC and QF. An additional insight obtained
by performing this evaluation is that the connection between
the SNN and the DAE that it was trained with is crucial, as
mixing between images that were denoised with a different DAE
than the SNN was trained on resulted in a major degradation in
performance (a decrease of more than 30% in detection accuracy).
6.1. Comparison to previous research
In this section, we compare ASSAF to the following state-ofthe-art methods in the J-UNIWARD steganalysis domain: ZengNet (Zeng et al., 2018), J-XuNet (Xu, 2017), SRNet (Boroumand
et al., 2019), and VNet and PNet (Chen et al., 2017).
Tables 5 and 6 present a comparison of the validation detection accuracy of ASSAF and the abovementioned methods for
the eight use cases evaluated. As can be seen, ASSAF outperforms the current state-of-the-art methods in all of the use cases.
In the tables we also present the percent of improvement (in
parentheses) obtained by ASSAF over the best performing existing
method in each use case. It can be seen that ASSAF provides
greater improvements for the higher QF evaluated (QF 95), which
is the more challenging scenario. It is clear that SENSE provides
a breakthrough in J-UNIWARD steganography detection and provides 6-40% relative improvement in all of the use cases tested
over existing state-of-the-art methods.
74
A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77
Fig. 10. Training and validation accuracy by epoch during the SNN training process: (a) QF 75, bpnzAC 0.4, and (b) QF 95, bpnzAC 0.4.
Table 4
The difference between images with and without a J-UNIWARD steganography payload, with QF of 75 and a bpnzAC of 0.4: (a) the
original image, (b) the stegno-image embedded with a payload using J-UNIWARD, (c) the difference between images a and b; black
indicates no difference, and white indicates the areas with the greatest difference (the image is from the BOSSBase (Bas et al., 2011)
dataset).
Table 5
Comparison of the detection accuracy of existing methods and the relative
improvement of ASSAF over the top performing method (at the four evaluated
bpnzAC ratios and QF 75).
Method
bpnzAC
0.1
0.2
0.3
0.4
VNet
PNet
Zeng-Net
J-XuNet
SRNet
0.638
0.642
0.54
0.671
0.679
0.776
0.7874
0.64
0.805
0.811
0.866
0.877
0.74
0.887
0.884
0.929
0.934
0.8
0.935
0.933
ASSAF
0.912 (34.3%↑)
0.982 (21%↑)
0.992 (12.2%↑)
0.993 (6.4%↑)
The graphs presented in Fig. 11 provide another comparison
of ASSAF and the existing state-of-the-art methods for the two
Table 6
Comparison of the detection accuracy of existing methods and the relative
improvement of ASSAF over the top performing method (at the four evaluated
bpnzAC ratios and QF 95).
Method
bpnzAC
0.1
0.2
0.3
0.4
VNet
PNet
J-XuNet
SRNet
0.529
0.541
0.544
0.572
0.572
0.601
0.614
0.656
0.667
0.681
0.693
0.748
0.74
0.746
0.763
0.823
ASSAF
0.803 (40%↑)
0.85 (29%↑)
0.874 (16.8%↑)
0.962 (16.8%↑)
QFs evaluated. It can be clearly seen that ASSAF is superior to the
existing methods in all eight use cases.
A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77
75
decreasing the image size as done in other research reduces the
complexity of the NN and provides less steganography space for
embedding (note that this might not influence the results as the
embedding rate is relative). Furthermore, reducing the size of
the image is less realistic real-life scenario and might harm the
properties of the image as it performs some sort of compression.
In addition to the significant improvement in detection accuracy presented above, in which ASSAF outperformed the stateof-the-art methods, the ASSAF architecture is smaller and less
complex than the DNN architectures used in those methods.
ASSAF has around 6.5M parameters, in contrast to the abovementioned methods which have more than 10M parameters. In
addition, our model has a total of 11 convolution layers (seven
in the DAE and four in the SNN) and two fully connected layers,
while the state-of-the-art methods possess more convolutional
layers; SRNet has 12 convolution layers, and J-XuNet has 20
convolution layers, in addition to one or more fully connected
layers. A reduced number of parameters simplifies the optimization problem the model is trying to solve; therefore, a model with
less parameters requires less computational resources. Thus, our
model is lighter and can be induced and trained faster; it only
takes several hours, as opposed to the more complex methods
which take longer (perhaps several days) to train on adequate
hardware which includes a powerful GPU.
7. Discussion and conclusions
Fig. 11. Comparison of the detection accuracy of existing methods and ASSAF
with QF 75 (a) and QF 95 (b) and different bpnzAC ratios.
Note that in our domain, especially in era of deep learning
methods, the BOSSBase dataset containing 10K images is sometimes insufficient for training a large neural network such as used
by current state-of-the-art methods. This situation has resulted
in differences in the experimental configurations used in previous research. The Zeng-Net (Zeng et al., 2018) neural network
architecture was trained differently on the BOSSBase dataset. In
their experiment, the authors created a larger dataset of 40K
images, slicing each of the 512X512 original images into four
256X256 images. Comparing ASSAF’s performance to the results
obtained in their work is slightly unfair, since more training
data was used, and thus their performance should be better.
Despite this, we can see that our model outperforms their model’s
performance. The SRNet experiments combined BOSSBase with
the BOWS2 (Bas & Furon, 0000) dataset in order to create a 20,000
256X256 image collection; this means that they also resized the
original 512X512 BOSSBase images into 256X256 images, which
resulted in an easier testing scenario. ASSAF outperformed the
SRNet architecture by a wide margin although SRNet, like other
state-of-the-art methods introduced in previous research, was
trained on a larger dataset by manipulating the original BOSSBase or by using additional datasets. Other papers, such as the
paper on J-XuNet (Xu, 2017), did not explain explicitly how they
divided the BOSSBase dataset in their evaluation, making a fair
comparison difficult. We believe that our experimental design is
rigorous and that our results can be compared to the others as
we only used the BOSSBase dataset without manipulation of the
data (e.g., cropping or resizing the original images), and we did
not make any compromises when testing ASSAF. For example,
In this paper, we presented ASSAF, a novel steganalysis architecture for the detection of J-UNIWARD steganography within
JPEG images. The ASSAF architecture is composed of a combination of two neural network architectures: a denoising autoencoder and a Siamese neural network. To the best of our
knowledge, we are the first to combine a DAE and SNN into one
architecture in order to solve a classification task.
We evaluated ASSAF extensively on the BOSSBase dataset
which contains 10,000 grayscale images. The BOSSBase dataset
is popular in steganography research and serves as the baseline
for all of the studies performed in the steganography and steganalysis domain. We conducted our experiments and evaluation
of ASSAF on eight use cases which combine four different sized
payloads (bpnzAC ratio) and two JPEG image quality factors.
Our results demonstrate the superiority of the ASSAF architecture over existing state-of-the-art methods for the detection
of J-UNIWARD steganography. ASSAF’S detection accuracy is between 0.803 (for the hardest use case: QF = 95 and bpnzAC =
0.1) and 0.993 (for the simplest use case: QF = 75 and bpnzAC =
0.4). ASSAF provides a relative improvement of 6%–40% over the
top performing existing state-of-the-art method in the accurate
detection of a J-UNIWARD steganography in a JPEG image. For
instance, in comparison to SRNet (Boroumand et al., 2019) which
achieved detection accuracy of 0.572 on the BOSSBase dataset for
QF = 95 and bpnzAC = 0.1, ASSAF achieved detection accuracy
of 0.803, an improvement of 40%. In addition, our architecture is
much simpler than current state-of-the-art methods and thus is
faster to train, has less overfitting, and is easier to scale.
The issue of the resilience of steganalysis models to additive noise is a controversial topic in the steganalysis domain,
as additive noise could be a form of spatial steganography or
it could be added in order to impair the model’s performance.
Some spatial steganography techniques (e.g., least significant bit
– LSB Mielikainen, 2006) select the candidate embedding pixels
randomly or pseudo-randomly, and the embedding is later done
by changing the pixels’ values. Steganography can also be modeled as additive noise within the cover image (Harmsen, 2003).
Other study (e.g., Singh & Shree, 2016) clearly state that the
stegno in the JPEG domain is additive noise. In this manner, the
76
A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77
properties of several steganography techniques are similar to the
additive noise properties which is changing pixels’ values in a
random manner. Thus, additive noise can be considered a form
of steganography (even if it does not contain a real payload), as
the ‘‘noise’’ may contain real information. As a result, additive
noise evaluations are not performed in steganalysis research,
and to the best of our knowledge, such evaluations have not
appeared in any other steganalysis papers. In addition, as digital
images incorporate noise during image acquisition and processing
(such as sensor noise, blur due to motion) (Zeng et al., 2018). It
is worth mentioning that images with a higher noise level are
better candidates for steganography and might already include
secret embedded information, making this sort of discussion a
controversial topic in the steganalysis domain. We believe that
the current process of evaluating our model addresses this topic
by including ‘‘benign’’ images that have some level of noise in
both the training and test phases in which our proposed solution
provided low FPRs, demonstrating the ability to handle with
additive noise.
Based on the results achieved in this research, we conclude
that ASSAF has great potential for improving the detection of
adaptive steganography in JPEG images.
7.1. Limitations
ASSAF was trained and tested only on the BOSSBase dataset,
which contains grayscale images at a fixed size of 512X512.
Thus, the results of this study do not provide an indication of
how the proposed method would perform on color images or
different sized images. The analysis of color images requires a
larger architecture that processes the three different image color
channels: red, green, and blue (RGB). In addition, in order to cope
with a larger image size, the proposed architecture must adapt
the input image by cropping or resizing the original image to
fit the architecture’s image input size, which might impact the
steganography ‘‘signature’’ on the image. Those changes, therefore, increase the model’s complexity and would thus require
much more computational resources to train. Those limitation
should be further explored in future research.
7.2. Future work
In future research, we plan to employ the ASSAF architecture
for the detection of other forms of steganography and different
image formats. We also plan to investigate the use of DAE for
removing steganography from images. Another possible research
vector focuses on the generalization of the architecture to support
the detection of several payload sizes or various steganography
techniques using a single model. And as mentioned earlier, we
also propose investigating the influence of various image sizes
and colors.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared
to influence the work reported in this paper.
Appendix
See Fig. 12.
Fig. 12. Detailed architecture of the SNN.
References
Ahmed, N., Natarajan, T., & Rao, K. R. (1974). Discrete cosine transform. IEEE
Transactions on Computers, C–23(1), 90–93. http://dx.doi.org/10.1109/T-C.
1974.223784.
Akhtar, N., Johri, P., & Khan, S. (2013). Enhancing the security and quality of
lsb based image steganography. In Proceedings - 5th International conference
on computational intelligence and communication networks (pp. 385–390).
http://dx.doi.org/10.1109/CICN.2013.85.
Bas, Patrick, Filler, T., & Pevný, T. (2011). Break our steganographic system: the
ins and outs of organizing BOSS (pp. 59–70). Berlin, Heidelberg: Springer,
http://dx.doi.org/10.1007/978-3-642-24178-9_5.
Bas, P., & Furon, T. (0000). BOWS-2. Retrieved from http://bows2.ec-lille.fr.
Boroumand, M., Chen, M., & Fridrich, J. (2019). Deep residual network for
steganalysis of digital images. IEEE Transactions on Information Forensics and
Security, 14(5), 1181–1193. http://dx.doi.org/10.1109/TIFS.2018.2871749.
Cheddad, A., Condell, J., Curran, K., & Mc Kevitt, P. (2010). Digital image
steganography: Survey and analysis of current methods. Signal Processing,
90(3), 727–752. http://dx.doi.org/10.1016/J.SIGPRO.2009.08.010.
A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77
Chen, M., Sedighi, V., Boroumand, M., & Fridrich, J. (2017). JPEG-Phase-aware
convolutional neural network for steganalysis of JPEG images. In Proceedings
of the 5th ACM workshop on information hiding and multimedia security (pp.
75–84). http://dx.doi.org/10.1145/3082031.3083248.
Chopra, S., Hadsell, R., & LeCun, Y. (2005). Learning a similarity metric discriminatively, with application to face verification. In Proceedings - 2005
IEEE computer society conference on computer vision and pattern recognition.
http://dx.doi.org/10.1109/CVPR.2005.202.
Coatrieux, G., Pan, W., Cuppens-Boulahia, N., Cuppens, F., & Roux, C. (2013). Reversible watermarking based on invariant image classification and dynamic
histogram shifting. IEEE Transactions on Information Forensics and Security,
8(1), 111–120. http://dx.doi.org/10.1109/TIFS.2012.2224108.
Harmsen, J. J. (2003). Steganalysis of additive noise modelable information
hiding. In Proc. SPIE electronic imaging.
Holub, Vojtěch, & Fridrich, J. (2012). Designing steganographic distortion using
directional filters. In WIFS 2012 - Proceedings of the 2012 IEEE international
workshop on information forensics and security (pp. 234–239). http://dx.doi.
org/10.1109/WIFS.2012.6412655.
Holub, Vojtech, & Fridrich, J. (2015). Low-complexity features for JPEG steganalysis using undecimated DCT. IEEE Transactions on Information Forensics and
Security, 10(2), 219–228. http://dx.doi.org/10.1109/TIFS.2014.2364918.
Holub, Vojtěch, Fridrich, J., & Denemark, T. (2014). Universal distortion function
for steganography in an arbitrary domain. Eurasip Journal on Information
Security, http://dx.doi.org/10.1186/1687-417X-2014-1.
Huang, F., Zhong, Y., & Huang, J. (2014). Improved algorithm of edge adaptive
image steganography based on LSB matching revisited algorithm. In LNCS:
vol. 8389, Lecture notes in computer science (including subseries lecture notes
in artificial intelligence and lecture notes in bioinformatics) (pp. 19–31). http:
//dx.doi.org/10.1007/978-3-662-43886-2_2.
Mielikainen, J. (2006). LSB Matching revisited. IEEE Signal Processing Letters, 13(5),
285–287. http://dx.doi.org/10.1109/LSP.2006.870357.
Pevný, T., Filler, T., & Bas, P. (2010). Using high-dimensional image models to
perform highly undetectable steganography. In LNCS: vol. 6387, Lecture notes
in computer science (including subseries lecture notes in artificial intelligence
and lecture notes in bioinformatics) (pp. 161–177). http://dx.doi.org/10.1007/
978-3-642-16435-4_13.
Qin, C., Chang, C. C., Huang, Y. H., & Liao, L. T. (2013). An inpainting-assisted reversible steganographic scheme using a histogram shifting mechanism. IEEE
Transactions on Circuits and Systems for Video Technology, 23(7), 1109–1118.
http://dx.doi.org/10.1109/TCSVT.2012.2224052.
Sharp, T. (2001). An implementation of key-based digital signal steganography. In
Lecture notes in computer science (including subseries lecture notes in artificial
intelligence and lecture notes in bioinformatics), (vol. 2137) (pp. 13–26). Berlin,
Heidelberg: Springer, http://dx.doi.org/10.1007/3-540-45496-9_2.
77
Singh, P., & Shree, R. (2016). A comparative study to noise models and image
restoration techniques. International Journal of Computer Applications, http:
//dx.doi.org/10.5120/ijca2016911336.
Solanki, K., Sarkar, A., & Manjunath, B. S. (2007). YASS: Yet another steganographic scheme that resists blind steganalysis. In LNCS: vol. 4567, Lecture
notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) (pp. 16–31). http://dx.doi.org/10.
1007/978-3-540-77370-2_2.
Tan, S., & Li, B. (2014). Stacked convolutional auto-encoders for steganalysis
of digital images. In 2014 Asia-pacific signal and information processing
association annual summit and conference. http://dx.doi.org/10.1109/APSIPA.
2014.7041565.
Taubman, D. S. (2002). JPEG2000: Image compression fundamentals, standards
and practice. Journal of Electronic Imaging, 11(2), 286. http://dx.doi.org/10.
1117/1.1469618.
Tsai, P., Hu, Y. C., & Yeh, H. L. (2009). Reversible image hiding scheme using
predictive coding and histogram shifting. Signal Processing, 89(6), 1129–1143.
http://dx.doi.org/10.1016/j.sigpro.2008.12.017.
van der Spoel, E., Rozing, M. P., Houwing-Duistermaat, J. J., Eline Slagboom, P.,
Beekman, M., de Craen, A. J. M., et al. (2015). Siamese neural networks
for one-shot image recognition. In ICML - Deep learning workshop. http:
//dx.doi.org/10.1017/CBO9781107415324.004.
Vincent, P. (2008). Extracting features with autoencoders. In Proceedings of the
25th international conference on machine learning (pp. 1096–1103). http:
//dx.doi.org/10.1145/1390156.1390294.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P. A. (2010). Stacked
denoising autoencoders: Learning useful representations in a deep network
with a local denoising criterion. Journal of Machine Learning Research (JMLR),
11, 3371–3408.
Wei, Wang (2018). New malware takes commands from memes posted on
twitter. Retrieved April 3, 2019, from https://thehackernews.com/2018/12/
malware-twitter-meme.html.
Xu, G. (2017). Deep convolutional neural network to detect j-UNIWARD. In
IH and MMSec 2017 - Proceedings of the 2017 ACM workshop on information hiding and multimedia security (pp. 67–73). Association for Computing
Machinery (ACM), http://dx.doi.org/10.1145/3082031.3083236.
Zeng, J., Tan, S., Li, B., & Huang, J. (2018). Large-scale JPEG image steganalysis
using hybrid deep-learning framework. IEEE Transactions on Information
Forensics and Security, 13(5), 1200–1214. http://dx.doi.org/10.1109/TIFS.2017.
2779446.
Zurada, Jacek M. (1992). Introduction to artificial neural systems. http://dx.doi.org/
10.1016/0925-2312(92)90018-k.
Download