Neural Networks 131 (2020) 64–77 Contents lists available at ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet ASSAF: Advanced and Slim StegAnalysis Detection Framework for JPEG images based on deep convolutional denoising autoencoder and Siamese networks Assaf Cohen a,b , Aviad Cohen a , Nir Nissim a,c , a b c ∗ Malware Lab, Cyber Security Research Center, Ben-Gurion University of the Negev, Israel Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Israel Department of Industrial Engineering and Management, Ben-Gurion University of the Negev, Israel article info Article history: Received 4 October 2019 Received in revised form 18 May 2020 Accepted 16 July 2020 Available online 29 July 2020 Keywords: Steganography Steganalysis Deep learning Autoencoder Siamese neural network Convolution neural network a b s t r a c t Steganography is the art of embedding a confidential message within a host message. Modern steganography is focused on widely used multimedia file formats, such as images, video files, and Internet protocols. Recently, cyber attackers have begun to include steganography (for communication purposes) in their arsenal of tools for evading detection. Steganalysis is the counter-steganography domain which aims at detecting the existence of steganography within a host file. The presence of steganography in files raises suspicion regarding the file itself, as well as its origin and receiver, and might be an indication of a sophisticated attack. The JPEG file format is one of the most popular image file formats and thus is an attractive and commonly used carrier for steganography embedding. State-of-the-art JPEG steganalysis methods, which are mainly based on neural networks, are limited in their ability to detect sophisticated steganography use cases. In this paper, we propose ASSAF, a novel deep neural network architecture composed of a convolutional denoising autoencoder and a Siamese neural network, specially designed to detect steganography in JPEG images. We focus on detecting the J-UNIWARD method, which is one of the most sophisticated adaptive steganography methods used today. We evaluated our novel architecture using the BOSSBase dataset, which contains 10,000 JPEG images, in eight different use cases which combine different JPEG’s quality factors and embedding rates (bpnzAC). Our results show that ASSAF can detect stenography with high accuracy rates, outperforming, in all eight use cases, the state-of-the-art steganalysis methods by 6% to 40%. © 2020 Elsevier Ltd. All rights reserved. 1. Introduction Steganography has become a buzzword lately, achieving notoriety for several malicious activities using images. Steganography is the art of embedding a secret message or payload within another form of media or communication method without the awareness of a third person or gatekeeper. Invisible ink is an example of a traditional steganography method in which specific lighting is required to detect the secret message, and the original message remains ‘‘unchanged’’. Modern forms of steganography focus on computer files, such as multimedia file formats (images, music files, videos, etc.), in addition to other forms of targeted files like Office files and communication protocols, as a means of embedding messages. ∗ Corresponding author at: Malware Lab, Cyber Security Research Center, Ben-Gurion University of the Negev, Israel. E-mail address: nirni@bgu.ac.il (N. Nissim). https://doi.org/10.1016/j.neunet.2020.07.022 0893-6080/© 2020 Elsevier Ltd. All rights reserved. Recently, a few cyber-attacks have included the use of image steganography,1 , 2 , 3 which has several advantages in such attack scenarios. First, it is less detectable. Second, it may confuse the gatekeeper and investigator, as image files arouse less suspicion. Third, it allows the attacker to use legitimate social networks to host steganography images, since images are commonly posted and shared on social networks and thus are considered harmless. Thus, there has been an increase in the use of steganography in the wild (e.g., a recent report of malware that received configuration and commands from its operator via mems images posted on Twitter Wei, 2018). Steganalysis is aimed at countering steganography by detecting its existence within a message. While steganalysis does not reveal the secret message itself, since it could be encrypted, it does determine whether a message embedded using steganography is present within a file. 1 https://thehackernews.com/2018/12/malware-twitter-meme.html 2 https://thehackernews.com/2016/12/image-exploit-hacking.html 3 https://securelist.com/steganography-in-contemporary-cyberattacks/79276/ A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77 The Joint Photographic Experts Group, or JPEG, is the most common image file format, primarily due to its lossy compression, which results in good image quality with reduced file size. Because of its popularity, JPEG images have become favored carriers for steganography. Several steganography techniques have been developed specifically for JPEG images. J-UNIWARD (Holub, Fridrich, & Denemark, 2014), proposed in 2014, is a sophisticated, adaptive steganography method for JPEG images. Content adaptive steganography methods take the image’s properties into account when embedding the secret payload, embedding the content in different places inside the image depending on the image’s properties; prior to the emergence adaptive steganography, image steganography embedding was accomplished using a pseudo-random key to determine where the data should be embedded in an image, without considering the differences between images. J-UNIWARD is very sneaky, as it attempts to determine the optimal places within the image to inject the secret payload, so that it leaves the least footprints, making it more difficult to detect. This form of adaptive steganography poses challenges to forensics and steganalizers. In the last decade, machine learning (ML) and deep neural networks (DNN) have evolved at a fast pace, and the use of these models in various domains has increased. This trend can be seen in both the steganography and steganalysis domains, where more and more methods have begun to utilize ML and DNN models. Recent steganalysis research on JPEG images (particularly in detecting the J-UNIWARD method) used DNN models, some of which achieved impressive detection rates (i.e., Xu, 2017; Zeng, Tan, Li, & Huang, 2018). However, the methods proposed have very complex DNN architectures; thus, the training time is long, and the architectures lack flexibility. Unfortunately, the performance of these approaches in extreme conditions (e.g., with a high JPEG quality factor and small embedding rate) deteriorates, with the detection rate dropping to around 65%, leaving much room for improvement. In this paper, we propose ASSAF, the Advanced and Slim StegAnalysis Detection Framework for JPEG images based on deep convolutional denoising autoencoder and Siamese networks, which is a novel deep neural network architecture composed of a convolutional denoising autoencoder and a Siamese neural network, for the detection of steganography in JPEG images embedded using J-UNIWARD. ASSAF is much lighter than architectures proposed in previous research and thus requires much less training time. In addition, to the best of our knowledge, we are the first to combine a denoising autoencoder and Siamese neural network in one architecture in order to solve a binary classification task. The major contributions of this paper are: 1. We combine a denoising autoencoder and Siamese neural network and leverage this novel pairing to address a binary classification task. This combination allows us to use both the original input data as well as the processed data as input to the Siamese neural network classifier. 2. We propose ASSAF, a novel deep neural network architecture, which demonstrates that the denoising autoencoder and Siamese neural network combination can efficiently detect J-UNIWARD’s adaptive steganography in JPEG images. 3. ASSAF significantly improves the detection capability of J-UNIWARD steganalysis, outperforming existing state-ofthe-art methods. The rest of the paper is organized as follows. In Section 2, we provide the necessary background for our work. In Section 3, we survey relevant related work from our field of research. We present ASSAF in Section 5 and evaluate it in Section 5. In Section 7, we discuss and conclude this research. 65 2. Background In the following section, we provide the background necessary for the reader to understand this study. We elaborate on several topics that will be covered in this paper, in order to improve understanding of our novel architecture. 2.1. JPEG image file format The most common image file format, the JPEGs lossy format provides a high compression ratio while providing image quality. JPEG file compression is based on several steps, including performing discrete cosine transformation (DCT), quantization, and encoding (Taubman, 2002). The image quality factor (QF) is one of the parameters of the JPEG file compression process. The QF determines the quality of the compressed image and the amount of compression to be applied. A large QF means that there is less compression and thus less data loss during the compression process, however a large QF also provides more space for steganography within a JPEG file and a bigger embedding ‘‘surface’’. 2.2. Image steganography Image steganography is a subdomain of steganography which focuses on embedding a secret message or payload within an image file. Image steganography research has been performed on several file formats, including PNG, BMP, and JPEG. Least significant bit (LSB) steganography is one of the most popular image steganography approaches today. Embedding the payload in the LSB means that the secret payload will be embedded within each pixel in the bit that influences the pixel the least, thus limiting its effect on the original image’s visual quality. Extensive research has been conducted on LSB steganography (e.g., Akhtar, Johri, & Khan, 2013; Huang, Zhong, & Huang, 2014; Mielikainen, 2006; Sharp, 2001), a form of steganography that is easy to implement and allows a significant payload size but is also relatively easy to detect. In contrast, the frequency steganography approach uses various transformation techniques (such as DCT Ahmed, Natarajan, & Rao, 1974) to transform the image into the frequency space and injects the secret payload into the frequency space. This form of steganography is much more secure and is more difficult to detect, however it offers less steganography space (i.e., space to embed data). Research Coatrieux, Pan, Cuppens-Boulahia, Cuppens, and Roux (2013), Qin, Chang, Huang, and Liao (2013), Solanki, Sarkar, and Manjunath (2007) and Tsai, Hu, and Yeh (2009) has shown that this form of steganography is efficient and is more secured than the simple LSB approach. Recently, adaptive steganography, a more sophisticated approach of steganography, emerged. Older forms of steganography embedded the secret message in the host images in the same form and did not take the image’s properties (such as monotonic areas that might enable the detection of embedded messages) into consideration. In adaptive steganography methods this is not the case, as the methods examine the properties of the image and identify the optimal spots in the image for embedding (e.g., the ‘‘noisier’’ parts of the image are good candidates for embedding, since changes in those areas of the image will be harder to detect). Adaptive steganography methods are more complicated to implement and have been shown to be harder to detect (Cheddad, Condell, Curran, & Mc Kevitt, 2010). One of the most well-known adaptive steganography techniques is the UNIWARD method and its JPEG variant, the J-UNIWARD method (Holub et al., 2014). Other forms of adaptive steganography include WOW (Holub & Fridrich, 2012) and HUGO (Pevný, Filler, & Bas, 2010). 66 A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77 2.3. Image steganalysis As the use of steganography methods for images has increased, the need to analyze and detect those method has grown as well. Image steganalysis focuses on revealing the existence of such embedded data in an image, rather than on revealing the actual content of the embedded payload. Over the years, several steganalysis techniques have been studied, and currently the trend in steganalysis research is to use machine learning and deep neural network algorithms. 2.4. Deep neural networks An artificial neural network (ANN) is a computing system inspired by the biological human brain (Zurada, 1992). Neural network architecture is composed of several neurons, each of which includes an activation function that ‘‘fires’’ that neuron, mimicking the neuron function in the human brain. The activation function of the neurons (also called the nonlinear properties) allows it to describe simple nonlinear functions. The combination of several neurons provides the ability to describe more complex and sophisticated functions. Neural networks are composed of input, output, and hidden layers; each layer is composed of several neurons that are connected to neurons in previous layers, but usually there are no connections between neurons in the same layer. A deep neural network (DNN), or deep learning (DL), is a neural network with more than two hidden layers. Those networks are able to learn very complex functions or situations. Deep learning architecture is widely used, and a variety of applications in different fields are based on this architecture (e.g., image processing, autonomous driving). Such technological advancements have resulted in a significant increase in the development of steganalysis methods based on deep learning architectures which is usually composed of convolution neural networks (CNN). A CNN is a form of deep learning that uses convolutions of filters applied on the image in order to perform some sort of processing of the input image. The filters of the CNN are learned during the network’s training process in order to find the best set of filters to fit a specific task. CNNs are a type of deep learning commonly used in the image and video processing domains and thus is also popular in the steganalysis domain. 2.4.1. Denoising Autoencoder (DAE) An autoencoder (AE) is a neural network architecture that is composed of two parts, an encoder and a decoder. The encoder transforms the input data (e.g., an image) into a latent representation which is smaller than the original input (thus also performing compression), identifying the meaningful features of the input. The decoder transforms the latent representation into the original data. Therefore, the autoencoder performs a compression–decompression process. Denoising autoencoders (DAEs) use the same encoder–decoder architecture. The DAE’s main purpose is to denoise the input. The main difference between an AE and a DAE is seen in the training phase; in the case of a DAE, in the training phase the input image contains some noise, and the feedback image provided to the DAE is the same image without the noise. This sort of training process allows the DAE to learn a latent representation of an image without the addition of noise, and thus will be able to remediate the noise that was injected into the input image. Some examples for the use of DAEs are found in the following papers (Vincent, 2008; Vincent, Larochelle, Lajoie, Bengio, & Manzagol, 2010). 2.4.2. Siamese Neural Network (SNN) A Siamese neural network is a form of neural network (NN) that receives two inputs (instead of one input as usual) and performs a classification. A Siamese neural network is composed of two identical legs that embed the input images, which are followed by one or more classifier layers. Both legs are identical in that their weights and architecture are the same, and thus each leg performs the same embedding process on a given input image; each leg performs the embedding process on one input image. After the embedding process, the classifier layers perform the classification task based on the distance or similarity of both input images’ embeddings. This form of architecture is usually used for a one-shot classification of images (van der Spoel, et al., 2015) (i.e., training a NN classifier with a small amount of samples per class) or face recognition (Chopra, Hadsell, & LeCun, 2005). An example of Siamese neural network architecture is provided in Fig. 4. 3. Related work Universal wavelet relative distortion, or UNIWARD (Holub et al., 2014), is an adaptive steganography method that has become quite popular since it was proposed by Holub et al. in 2014. UNIWARD was designed to evade detection by embedding the secret payload while maintaining the distortion distribution of the original image. UNIWARD uses direction filters (horizontal, vertical, and diagonal) to determine the amount of distortion in the image in order to identify the areas in the image which have more noise; these areas will then be candidates for embedding the secret information. The UNIWARD method aims to minimize the distortion changes that may occur when the secret payload is embedded into the image. UNIWARD determines the amount of data that can be injected into the image by the embedding rate parameter, measured by bits per nonzero AC DCT coefficient (bpnzAC) (Taubman, 2002). As the bpnzAC increases, the capacity of embedded data increases, and thus it is easier to detect the presence of steganography. J-UNIWARD is the variant of the UNIWARD method for JPEG images. In comparison to other adaptive JPEG steganography methods, J-UNIWARD showed good evasion performance with a reasonable sized payload. In addition, J-UNIWARD is quite popular and is the most sneaky JPEG adaptive steganography method (Zeng et al., 2018). Therefore, we will primarily focus on the J-UNIWARD steganalysis in this paper. Fig. 1 demonstrates steganography applied on an image using JUNIWARD: (a) an original image, (b) the steganography image embedded with a payload using J-UNIWARD, (c) the difference between the original image and the image embedded with a payload using J-UNIWARD; the color is brighter as the difference is greater (black indicates no difference, and white indicates the areas with the greatest difference). The difference between the images shown in Fig. 1(c) reflects the changes resulting from embedding the secret payload. As seen, the secret message is not evenly distributed across the entire image. In addition, noisier areas (like edges, which in this case are the rocks in the image) in the image contain more changes than homogenous areas. Since the research introducing J-UNIWARD was published, several studies were performed in order to develop efficient steganalysis techniques aimed at this state-of-the-art steganography method. There are two major approaches: designing handcrafted features for steganography detection and the use of deep learning methods. Recently, DL methods have gained popularity in a large variety of domains, including the steganalysis domain. The main advantage of DL is its flexibility. In contrast to other architectures that needed customized handcrafted features as input, deep learning architectures receive the raw input data and identify the relevant A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77 67 Fig. 1. The difference between images with and without a J-UNIWARD steganography payload, with QF of 75 and a bpnzAC of 0.4: (a) the original image, (b) the stegno-image embedded with a payload using J-UNIWARD, (c) the difference between images a and b; black indicates no difference, and white indicates the areas with the greatest difference. Source: The image is from the BOSSBase (Bas, Filler, & Pevný, 2011) dataset. features for the desired task during the training phase. Therefore, DL architectures have the potential to improve the steganography detection rate by learning the characteristics of the images with embedded payloads and non-steganographic images. One of the first approaches that included AE was proposed by Tan and Li (2014) in 2014. In their method, several layers of convolutional autoencoders were stacked in order to detect steganography within images (not just in JPEGs in particular). Unfortunately, this novel method was unable to outperform existing steganalysis techniques. State-of-the-art methods that use deep learning for the detection of J-UNIWARD steganography in JPEG images are described below. In 2017, Chen, Sedighi, Boroumand, and Fridrich (2017) proposed a DL JPEG steganalysis method which uses a convolutional neural network. Two network configurations were proposed: PNet and VNet. The high-level architecture of the method includes five groups of convolutional layers with batch normalization (BN) and TanH/ReLU activation functions; some of the groups also include average pooling. After the five groups there is a fully connected layer, which is followed by a softmax layer that performs the classification. The main difference between VNet and PNet is that PNet is composed of several parallel convolution layers (which is not a very common DL architecture), while VNet has no parallel layers but has more filters per layer than PNet. For example, in one of the layers of PNet there are 64 different parallel convolutional layers where each layer includes 64 filters (which have a total of 4,096 convolutional filters); the equivalent layer in VNet has 1,024 filters. The large amount of layers in both PNet and VNet result in a very complex network with many parameters to train. This architecture improved upon the performance of existing methods that used handcrafted features, however it is much more complex and thus takes a lot of time to train and maintain. Another DL architecture proposed by Zeng et al. (Zeng-Net) (Zeng et al., 2018) in 2018 includes several convolution layers in several subnets, in addition to a large fully connected layer. Since the proposed architecture is so massive and complex, the authors had to train it on a relatively large dataset of images containing at least 50K images, in contrast to other research that used a dataset of around 10K images. Their results also outperformed handcrafted feature-based methods detection rate, but left room for improvement in terms of the bpnzAC ratios which were low. Additional research in JPEG steganalysis was performed by Xu (2017) in 2018. He proposed a DNN architecture (also referred to as J-XuNet) composed of 20 convolutional layers. In this method, the first layer is a DCT layer that performs 16 DCT filters in order to perform preprocessing on the input image, as previous research (Holub & Fridrich, 2015) found that doing so improves the detection rate of the network. This study demonstrated the best performance in high bpnzAC ratios, however the proposed method suffers at low bpnzAC ratios, providing accuracy close to that of random choice. In addition, like other prior work mentioned above, is very complex and takes a long time to train and converge (e.g., in order to converge on the BOSSBase dataset it took this model nearly 300 epochs). SRNet (Boroumand, Chen, & Fridrich, 2019), a new state-ofthe-art method, was proposed by Boroumand et al. in 2019. SRNet is composed of 12 convolution layers in four different architectures and uses a combination of batch normalization, residual links, and average pooling. In contrast to prior work such as Holub and Fridrich (2015) and Xu (2017), SRNet did not use known filters or perform DCT transformation preprocessing on the input images. Experimental results showed better performance than other existing state-of-the-art methods. Nevertheless, this method is still very complex and includes a large number of layers and parameters and thus takes a long time to train (e.g., 2.5 days on a 20K image dataset). In addition, the performance of this method in challenging steganography use cases is relatively low, with a detection rate approaching 70%. We believe that it is possible to achieve improved steganography detection by using the novel deep neural network architecture proposed in this paper, leveraging both a DAE and Siamese neural network. 4. ASSAF’s deep neural network architecture Our proposed architecture for the efficient detection of steganography within JPEG images embedded using J-UNIWARD, is composed of two phases. The first phase uses a denoising autoencoder which performs preprocessing on the input image. Later, in the second phase, we use a Siamese neural network classifier in order to determine whether the original image contains steganography. The Siamese neural network does this by measuring the distance between the original input image and the DAE processed image. The rationale behind the proposed architecture is that the preprocessing performed on the image can be learned by the DAE during training. This is in contrast to the use of fixed known filters as was done in previous work, such as (Holub & Fridrich, 2015) and Xu (2017). Furthermore, our proposed architecture does not identify the steganography artifacts directly like previous work such as Boroumand et al. (2019) and Xu (2017) but instead our approach learns the meaningful distance between the representations of those two images. We 68 A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77 Fig. 2. High-level architecture of the proposed method. are the first to use a Siamese neural network in the steganalysis domain, making our approach revolutionary and novel. In this section, we provide a detailed description of the architecture’s two phases. Fig. 2 illustrates the framework; as can be seen, phase 1 is composed of the DAE, and phase 2 includes the Siamese neural network and the final classification, determining whether the input image contains a steganographic payload or not. 4.1. Denoising Autoencoder (DAE) The major role of the DAE in the ASSAF method’s architecture is to perform preprocessing on the input image which will emphasize the existence of steganography within the input image. The DAE attempts to ‘‘correct’’ some of the noise that was injected into the image during the embedding process. In so doing, it makes a distinction between images in which steganography exists and other images in which steganography does not exist (referred to here as non-steganographic images), allowing the Siamese neural network to identify those subtle changes and making the correct classification. Previously proposed methods like Xu’s and others (Holub & Fridrich, 2015; Xu, 2017) used predefined DCT filters in order to preprocess the image. In our framework, we trained a DAE to learn the relevant filters for the preprocessing. This form of preprocessing is crucial for the Siamese neural network to correctly identify the presence or lack of steganography within the input image. The DAE is composed of two parts, an encoder and a decoder. The input for the DAE is a 512X512X1 pixel image. The encoder has three convolution layers which embed the input image into a latent representation that is smaller than the original image by a factor of 3.2. The decoder receives the latent representation of the image as input and decodes it to the original size. The decoder uses three transposed convolution layers with the same properties as the respective convolution layers of the encoder. Fig. 3 presents the DAE architecture. In total, the DAE contains six layers with 23,229 parameters. The encoder contains the following convolution layers. The first layer is composed of four filters with a seven by seven kernel size (a.k.a. filter size). The second layer has 10 filters and a five by five kernel size. The third and last layer of the encoder has 20 filters with a three by three kernel size. In addition, each of the layers has a stride of two, and one-pixel padding. After the latent representation has been created by the encoder, the decoder performs the decompression task with the following transpose convolution layers. The first layer of the decoder has similar properties to the last layer of the encoder: it has 20 filters with a three by three kernel size. The second layer has 10 filters and a five by five kernel size. The last layer of the decoder has four filters with a seven by seven kernel (which is similar to the first layer of the encoder). Just like the encoder, each of the layers in the decoder also has a stride of two and one-pixel padding. The output layer contains a single filter of transpose convolution layer with one by one kernel. The output of the DAE is a 512X512X1 pixels sized image, similar to the input image dimensions. The process of training the DAE is done by providing images in which steganography is present as input; the same image without the presence of steganography is the expected output. This form of training enables the DAE to identify the subtle changes to the image resulting from the steganography process; essentially the DAE tries to ‘‘denoise’’ those changes. Once the DAE has been trained, each of ASSAF’s input images is processed by the trained DAE. Later, the architecture’s second phase, both the DAE processed image and the original input image will serve as the input of the Siamese neural network. 4.2. The Siamese Neural Network (SNN) The second phase of our architecture contains a Siamese neural network which is responsible for the classification of the JPEG images. The input for the SNN is the original image (that might be embedded with steganography or not) and the DAE processed image. The SNN embeds the two input images in order to identify a meaningful distance between the representations of those two images. The representations are made by the SNN legs. The SNN legs are identical: they have the same network architecture and model weights. Thus, the embedding process performed on both input images is the same (and is learned during the training process). Each of the SNN legs is composed of the following convolution layers. The first convolution layer has 16 filters with a kernel size of seven by seven with one-pixel padding and a stride of two. After the convolution layer, there is a batch normalization layer. The second convolution layer has 32 filters with a kernel size of five by five with one-pixel padding and a stride of one. After this layer there is a batch normalization layer following by a 1D max pooling with a stride of two. The third convolution layer contains 64 filters with a kernel size of three by three with a stride of one and one-pixel padding. This is followed by a batch normalization layer and 1D max pooling with a stride of two. The fourth and final convolution layer contains 16 filters with a one by one kernel size with a stride of one and one-pixel padding. This layer performs dimensionality reduction in order to decrease the number of parameters needed and prevent model overfitting. At the end of each leg there is a flattening layer. The classification component of the SNN is composed of a distance layer which calculates the distance between the outputs of both legs’ flattening layers; the distance is the absolute difference between the embedding representation vectors produced by both legs. The distance layer is followed by a 100 neuron fully connected layer with an ReLU activation function and a 0.2 dropout, and finally a sigmoid neuron which performs the final classification. The Siamese neural network’s general architecture is presented in Fig. 4. A more detailed illustration of the architecture is presented in Fig. 12 in the Appendix. A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77 69 Fig. 3. The denoising autoencoder architecture. Source: The image is from the BOSSBase (Bas et al., 2011) dataset. Fig. 4. The architecture of the Siamese neural network. Source: The image is from the BOSSBase (Bas et al., 2011) dataset). Fig. 5. Data division into subsets used to train the DAE and SNN, and validate the SNN. Some aspects of this architecture were inspired from insights concluded in Xu’s work (Xu, 2017). In Xu’s best performing network, only convolution layers were used (max pooling layers were not used). We found that the use of only convolution layers in our model did not improve its performance; however, we found that using this method (i.e., using only convolution layers) in the first layer did improve the model’s performance. Note that the DAE must be fully trained prior to training the SNN, in order to be able to provide the DAE processed image as the second input of the SNN. During the training phase we provide the SNN with the correct classification (whether or not the input image has the presence of steganography or not) as feedback. That way the SNN can determine the relevant representation of the images for the classification task. 5. Evaluation In this section, we evaluate ASSAF and compare it to existing state-of-the-art methods. We begin by presenting the data collection used for evaluation and then discuss the evaluation metrics used. 5.1. Data collection and preparation In our experiments we used the BOSSBase dataset (Bas et al., 2011), the most popular dataset for steganography. Since its publication in 2011, it has appeared in almost every respected steganography and steganalysis research paper. This dataset includes 10,000 non-compressed, grayscale, 512X512 images. We processed the images as follows: (1) we converted the images to the JPEG file format at two QFs: 95 and 75; (2) we applied the J-UNIWARD steganography method with four bpnzAC levels for each of the QFs: 0.1, 0.2, 0.3, and 0.4, thus creating eight different use cases. Each of the use cases was then divided randomly into the various configurations of training and evaluation sets used in our experiments (described later in the paper). 5.2. Research questions Our experiments are aimed at answering the following research questions: 1. Is the ASSAF deep learning architecture able to detect JUNIWARD steganography in JPEG images? 2. Does ASSAF provide better detection results than the current state-of-the-art methods? 5.3. Evaluation metrics In our experiments, we use the following evaluation metrics: 70 A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77 The Accuracy, presented in Eq. (1), assesses our model’s ability to correctly label (classify) the images. The true positive (TP) is the amount of steganography images that were correctly labeled as containing steganography, and the true negative (TN) is the amount non-steganography images that were correctly labeled as not containing steganography. The True positive rate (TPR), presented in Eq. (2), is the rate of correctly labeling the positive class. The False positive rate (FPR), presented in Eq. (3), is sometimes referred to as the rate of false alarms. It is calculated by dividing the number of false positives (the instances that the model incorrectly classified as positive) by the total number of negative samples (the images that do not contain steganography). Accuracy = (True Positiv e + True Negativ e) Number of samples (1) Accuracy equation TPR = True Positiv e Number of positiv e samples TPR equation FPR = False Positiv e Number of negativ e samples FPR equation Fig. 6. Loss during DAE training (with QF 75 and bpnzAC 0.4). (2) (3) The AUC is the area under the receiver operating characteristic curve (ROC curve). The ROC curve shows the tradeoff between the TPR and FPR across different threshold values. The AUC score is a commonly used metric for binary classifier performance in the machine learning domain. 5.4. Experimental design In this section, we describe the experiments conducted in order to evaluate ASSAF, our proposed DNN architecture for JUNIWARD steganography detection. The ASSAF architecture differs from the ‘‘standard’’ deep neural network architecture as it is composed of two smaller neural networks, and thus requires a modified approach for training and evaluation. Each of the NNs in our model must be trained in a different manner with a different set of images in order to avoid feeding poisoned or already ‘‘seen’’ images between the two NNs. In order to tackle this situation, we had to plan our evaluation carefully. The following subsections provide a detailed description of how ASSAF was trained and evaluated (note that this was done using the BOSSBase dataset on the eight use cases presented in Section 5.1 each use case is referred to as an experiment). For each use case we randomly split the dataset into subsets: (1) a training set for the DAE, (2) a training set for the SNN, and (3) a validation set for the final classification using the SNN. The random split ensures that ASSAF performs well in various scenarios and presents biases and variance that might stem from the data split. Fig. 5 illustrates the data division into subsets. We conducted the experiments using the Python programming language and Keras,4 an open source, high-level neural network library. 5.4.1. Training ASSAF Training the DAE In order to train the DAE, we first randomly selected 5K images from the BOSSBase dataset. We embedded them with steganography using J-UNIWARD, creating 5K pairs of images (meaning 10K images, 5K of which contain steganography and 5K without). During the DAE training, we use the 5K steganography images 4 https://keras.io/ as input and provide the matching original images (without the steganography) as feedback. After the DAE is trained, we use it to provide the processed input for the SNN training step. During the training process, we monitored the DAE loss and mean squared error (MSE) values in each epoch. We observed that when the loss value is less than 0.485 and the MSE value is 0.01 or less, the performance of the overall architecture improves. In the case of higher MSE and loss values (e.g., caused by shorter training periods), the SNN cannot effectively differentiate between the steganography and non-steganography images; and this negatively impacts the performance of the overall architecture. We conducted a preliminary experiment in order to determine the number of images to use to train the DAE and found that it is better to use at least 5K images; the DAE training process is less efficient when fewer images are used, compromising the SNN classification capabilities. We also found that using additional images when training the DAE did not significantly improve the loss value. For example, when training on all 10K images of the BOSSBase dataset, the loss value plateaued at around 0.47. Fig. 6 presents the DAE loss during training. As can be seen, the loss decreases over the epochs and reaches a plateau at a loss of approximately 0.484. In order to achieve the desired loss and MSE that were mentioned earlier, the DAE needs about 200–250 epochs of training; further training will not increase the results dramatically and might cause overfitting. Training the SNN From the remaining 5K images of the 10K image BOSSBase dataset (those which were not used to train the DAE), we randomly selected 4K images in order to train the SNN. We embedded them with steganography using J-UNIWARD, creating 4K pairs of images (meaning 8K images, 4K of which contain steganography and 4K without). Later we pass these pairs of images through the trained DAE, creating an additional 4K pairs of images. In total, we have 16K images divided into four groups, each of which contains 4K images. The first group contains the 4K original images; the second group contains the 4K original images embedded with steganography; the third group contains the 4K images of the first group after the processing of the trained DAE; and the fourth group contains the 4K images of the second group after the processing of the trained DAE. We trained the SNN using 8K pairs of instances of the same image (in total 16K images as mentioned earlier), the first instance of each pair is the image that was not processed by the DAE (groups 1 and 2), and the second instance is the processed image (groups 3 and 4); the pairs were fed as input into the two legs of the SNN. Table 1 describes the four group of images. A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77 71 Fig. 7. The Siamese neural network’s training process: (a) An input image (before DAE) taken from a mixed steganography/non-steganography image pool (b) The second input image which is image (a) after processing by the DAE (c) The SNN leg; both legs are the same, and hence have the same weights and architecture (d) The SNN classification layers (e) The output of the SNN is a classification. Source: The images are from the BOSSBase (Bas et al., 2011) dataset). Fig. 8. The architecture flow for the evaluation phase (a) An input image for the validation set, a mixed steganography/non-steganography image pool (b) The trained DAE which performs processing on the input image (c) The second input image which is image (a) after processing by the DAE (d) The SNN trained legs; both legs are the same, and hence have the same weights and architecture (e) The trained SNN classifier (f) The output of the SNN is a classification. Source: The images are from the BOSSBase (Bas et al., 2011) dataset. Table 1 Description of the groups of images use. 5.5. Evaluation of ASSAF #Group Size Contains steg. Processed with DAE Training the SNN 1 2 4K 4K No Yes No No SNN 1st leg 3 4 4K 4K No Yes Yes Yes SNN 2nd leg The feedback given back to the SNN during training is zero if the original image did not contain steganography and one if it contains steganography. Fig. 7 illustrates the SNN training process. The Siamese neural network training process, as in most of DL methods, is performed with minibatches. We conducted a preliminary experiment in order to find the best training configuration for the SNN. We found that the most effective learning is achieved when the same image appears twice in a minibatch, once with steganography and once without steganography. This configuration helps the model learn only the relevant ‘‘signal’’ of the steganography and prevents it from being fooled by the visualization of the image itself. Training the Siamese neural network using a different configuration degrades the detection performance by at least 10%. The remaining 1K original images (out of the database’s 10K images) were then embedded with steganography using J-UNIWARD, resulting in a total of 2K pairs of images: 1K that contains steganography, and 1K that does not contain steganography (again, the same pairs as before). Neither the SNN nor the DAE have seen these 2K images during the training process, and thus they are used as the validation set for ASSAF. Note that the validation set contains pairs of the same image (one image without steganography, and the same image with steganography); this is in contrast to validation sets typically used in which all of the images are visually different. This is a difficult test, as the model must not get confused by the image’s visuality when attempting to correctly classify the images. Basic Evaluation We conducted the evaluation process as follows. We randomly chose the images from the 2K image pool and process it with the trained DAE, receiving the processed version of the image as output. Next, we fed the SNN two inputs, the original image and the processed image. Finally, the classification output of the SNN determines whether the input image contains a J-UNIWARD steganography payload. Fig. 8 illustrates the evaluation process 72 A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77 Fig. 9. A comparison between the original images and the denoised stegno-images processed by the DAE (with QF of 75 and a bpnzAC of 0.4), with and without a J-UNIWARD steganography payload: (a) and (d) the original image, (b) the stegno-image embedded with a payload using J-UNIWARD, (c) the difference between images a and b (black indicates no difference, and white indicates the areas with the greatest difference). (e) the denoised stegno-image processed by the trained DAE (f) the difference between images d and e (black indicates no difference, and white indicates the areas with the greatest difference). Source: The image is from the BOSSBase (Bas et al., 2011) dataset. Table 2 The validation results of ASSAF architecture in our experiments. QF Metric used were randomly selected and remained the same throughout the whole experiment. bpnzAC 0.1 0.2 0.3 0.4 75 TPR FPR Accuracy AUC 0.921 0.097 0.912 0.9691 0.985 0.021 0.982 0.9981 0.994 0.009 0.9925 0.9995 0.993 0.006 0.9935 0.9997 95 TPR FPR Accuracy AUC 0.765 0.158 0.8035 0.8766 0.809 0.108 0.8505 0.9332 0.899 0.15 0.8745 0.9513 0.975 0.051 0.962 0.9931 explained above. As mentioned before this evaluation was done for each of the 8 use cases presented in Section 5.1. Evaluating the Robustness of ASSAF against Different QF and bpnzAC parameters In addition to the basic evaluation described above, we performed an additional experiment to evaluate the robustness of the proposed architecture. In this experiment our goal was to thoroughly evaluate our proposed architecture and assess its robustness when using different QF and bpnzAC parameters in the training and test phases. In order to do so, we used eight different combinations of QF and bpnzAC parameters, based on two QF levels (75 and 95); for each QF we used four different bpnzAC levels (0.1, 0.2, 0.3, 0.4), thus examining a total of eight different combinations. In each one of the eight QF and bpnzAC combinations, we trained the architecture and tested it against the other combinations of QF and bpnzAC. In order to ensure a fair comparison during the cross-validation process, the training and test sets 6. Results In this section, we provide the results of our comprehensive evaluation performed on eight different use cases and compare the evaluation results with the performance of current state-ofthe-art JPEG steganalysis methods. The results of the eight experiments are presented in Table 2. The validation results of ASSAF architecture in our experiments. Recall that each experiment is conducted on a unique combination of the bpnzAC and quality factor. The results show that, as anticipated, as the bpnzAC increases and the QF decreases, the accuracy, AUC, and TPR improves, while the FPR decreases. In the table it can also be observed that the model’s architecture achieves what is likely its maximal detection accuracy of 0.993 with the combination of QF 75 and higher bpnzAC ratios. In the extreme case of QF 95 and bpnzAC of 0.1, the model struggles and obtains a higher FPR and lower accuracy of 0.803, but still outperforms existing state-of-the-art methods, as we discuss later on. Table 3 presents an example of an image with steganography using J-UNIWARD and without steganography, providing a comparison between the original input image and the DAE output image, and the difference between the two images for each case. As can be seen, the DAE processing did not impact the image’s visual quality. Furthermore, a closer look at the images shows small differences between the steganography and non-steganography images; We hypothesize that these small differences will be used by (or will enable) the SNN to differentiate between images that contain steganography and those that do not. A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77 73 Table 3 The difference between input and output image from the DAE. The difference between both images is greater as the color is brighter (white is the biggest difference while black means no difference) (the image is from the BOSSBase (Bas et al., 2011) dataset). Fig. 9 provides a comparison between the original image and the stegno-image (embedded with a payload using J-UNIWARD and without an embedded payload) and illustrates the difference between these images. As can be seen from this comparison, the DAE denoises the noisier areas where it suspects the J-UNIWARD method will embed the payload, such as the lighthouse itself; however, as seen in the images showing the difference (Fig. 9(c) and (f)), J-UNIWARD did not embed much information within the lighthouse. This shows that the DAE does not transform the stegno-image to its original form but tries to fix all of the possible locations within the image that might be populated with the J-UNIWARD payload. Therefore, the output of the DAE is a preprocessed form of the input image that contains subtle changes to the image, especially in the noisier areas of the image which are usually occupied by the J-UNIWARD steganography payload. The DAE preprocesses the image, an operation that subsequently assists the SNN in the classification task. In the steganalysis domain it is quite common to preprocess the images in order to improve the steganalysis model’s performance (Holub & Fridrich, 2015; Xu, 2017). Several studies that employed CNNs used a fixed set of filters which was empirically found to improve the model’s performance Chen et al. (2017). However, in our study, instead of using the commonly used fixed filters, we tried to learn the preprocessing process using an NN. We found that the DAE performs this task efficiently, especially when later combined with the SNN. The process of training the SNN is relatively quick, and the network usually converges in less than 100 epochs of training, which takes around two to three hours on a proper GPU. Fig. 10 presents the model’s accuracy during the training of the SNN for the combination of a bpnzAC of 0.4 and each QF evaluated (QF 95 and QF 75). In each epoch of the model’s training there are 160 weights update iterations. As can be seen, the accuracy of QF 75 is ‘‘smoother’’ than that of QF 95; this is to be expected, because the QF 95 is the more difficult use case for the model. It is also quite clear that the model converges within a few epochs. With QF 75 there is no difference between the training accuracy and the validation accuracy, showing that there is no model overfitting, while this is not the case with QF 95 (e.g., on the thirtieth epoch, the QF 95 there is model overfitting, but it later recovers). Table 4 presents the results of our evaluation of ASSAF’s robustness in terms of detection accuracy. The table presents the results in the form of a heat map where red symbolizes lower detection rates and green represents higher detection rates. The diagonal of the table contains the results of the ASSAF instances that were trained and tested on the same QF and bpnzAC setting. In our study, we define robustness as the model’s ability to effectively detect steganography in images that were embedded with bpnzAC and QF levels that are different than the levels on which the model was trained on. In the table it can be seen that models with lower bpnzAC levels are more robust than models with higher bpnzAC levels. In addition, there was a major difference in the results when using different QF models; models that were trained on QF = 95 tend to perform better than models which were trained on QF = 75, as can be seen in the right most column (AVG) in Table 4. It also seems that in some cases, the models with lower bpnzAC levels tend to perform slightly better than the models that were trained and tested on the same bpnzAC level. For example, the model that was trained on QF 75 and bpnzAC 0.1 outperformed the model that was trained on QF 75 and bpnzAC 0.2 by 0.6% in terms of detection accuracy. In practice, training the architecture with QF = 95 and a bpnzAC value of 0.1 was more likely to detect a steganography payload with a different combination of bpnzAC and QF. An additional insight obtained by performing this evaluation is that the connection between the SNN and the DAE that it was trained with is crucial, as mixing between images that were denoised with a different DAE than the SNN was trained on resulted in a major degradation in performance (a decrease of more than 30% in detection accuracy). 6.1. Comparison to previous research In this section, we compare ASSAF to the following state-ofthe-art methods in the J-UNIWARD steganalysis domain: ZengNet (Zeng et al., 2018), J-XuNet (Xu, 2017), SRNet (Boroumand et al., 2019), and VNet and PNet (Chen et al., 2017). Tables 5 and 6 present a comparison of the validation detection accuracy of ASSAF and the abovementioned methods for the eight use cases evaluated. As can be seen, ASSAF outperforms the current state-of-the-art methods in all of the use cases. In the tables we also present the percent of improvement (in parentheses) obtained by ASSAF over the best performing existing method in each use case. It can be seen that ASSAF provides greater improvements for the higher QF evaluated (QF 95), which is the more challenging scenario. It is clear that SENSE provides a breakthrough in J-UNIWARD steganography detection and provides 6-40% relative improvement in all of the use cases tested over existing state-of-the-art methods. 74 A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77 Fig. 10. Training and validation accuracy by epoch during the SNN training process: (a) QF 75, bpnzAC 0.4, and (b) QF 95, bpnzAC 0.4. Table 4 The difference between images with and without a J-UNIWARD steganography payload, with QF of 75 and a bpnzAC of 0.4: (a) the original image, (b) the stegno-image embedded with a payload using J-UNIWARD, (c) the difference between images a and b; black indicates no difference, and white indicates the areas with the greatest difference (the image is from the BOSSBase (Bas et al., 2011) dataset). Table 5 Comparison of the detection accuracy of existing methods and the relative improvement of ASSAF over the top performing method (at the four evaluated bpnzAC ratios and QF 75). Method bpnzAC 0.1 0.2 0.3 0.4 VNet PNet Zeng-Net J-XuNet SRNet 0.638 0.642 0.54 0.671 0.679 0.776 0.7874 0.64 0.805 0.811 0.866 0.877 0.74 0.887 0.884 0.929 0.934 0.8 0.935 0.933 ASSAF 0.912 (34.3%↑) 0.982 (21%↑) 0.992 (12.2%↑) 0.993 (6.4%↑) The graphs presented in Fig. 11 provide another comparison of ASSAF and the existing state-of-the-art methods for the two Table 6 Comparison of the detection accuracy of existing methods and the relative improvement of ASSAF over the top performing method (at the four evaluated bpnzAC ratios and QF 95). Method bpnzAC 0.1 0.2 0.3 0.4 VNet PNet J-XuNet SRNet 0.529 0.541 0.544 0.572 0.572 0.601 0.614 0.656 0.667 0.681 0.693 0.748 0.74 0.746 0.763 0.823 ASSAF 0.803 (40%↑) 0.85 (29%↑) 0.874 (16.8%↑) 0.962 (16.8%↑) QFs evaluated. It can be clearly seen that ASSAF is superior to the existing methods in all eight use cases. A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77 75 decreasing the image size as done in other research reduces the complexity of the NN and provides less steganography space for embedding (note that this might not influence the results as the embedding rate is relative). Furthermore, reducing the size of the image is less realistic real-life scenario and might harm the properties of the image as it performs some sort of compression. In addition to the significant improvement in detection accuracy presented above, in which ASSAF outperformed the stateof-the-art methods, the ASSAF architecture is smaller and less complex than the DNN architectures used in those methods. ASSAF has around 6.5M parameters, in contrast to the abovementioned methods which have more than 10M parameters. In addition, our model has a total of 11 convolution layers (seven in the DAE and four in the SNN) and two fully connected layers, while the state-of-the-art methods possess more convolutional layers; SRNet has 12 convolution layers, and J-XuNet has 20 convolution layers, in addition to one or more fully connected layers. A reduced number of parameters simplifies the optimization problem the model is trying to solve; therefore, a model with less parameters requires less computational resources. Thus, our model is lighter and can be induced and trained faster; it only takes several hours, as opposed to the more complex methods which take longer (perhaps several days) to train on adequate hardware which includes a powerful GPU. 7. Discussion and conclusions Fig. 11. Comparison of the detection accuracy of existing methods and ASSAF with QF 75 (a) and QF 95 (b) and different bpnzAC ratios. Note that in our domain, especially in era of deep learning methods, the BOSSBase dataset containing 10K images is sometimes insufficient for training a large neural network such as used by current state-of-the-art methods. This situation has resulted in differences in the experimental configurations used in previous research. The Zeng-Net (Zeng et al., 2018) neural network architecture was trained differently on the BOSSBase dataset. In their experiment, the authors created a larger dataset of 40K images, slicing each of the 512X512 original images into four 256X256 images. Comparing ASSAF’s performance to the results obtained in their work is slightly unfair, since more training data was used, and thus their performance should be better. Despite this, we can see that our model outperforms their model’s performance. The SRNet experiments combined BOSSBase with the BOWS2 (Bas & Furon, 0000) dataset in order to create a 20,000 256X256 image collection; this means that they also resized the original 512X512 BOSSBase images into 256X256 images, which resulted in an easier testing scenario. ASSAF outperformed the SRNet architecture by a wide margin although SRNet, like other state-of-the-art methods introduced in previous research, was trained on a larger dataset by manipulating the original BOSSBase or by using additional datasets. Other papers, such as the paper on J-XuNet (Xu, 2017), did not explain explicitly how they divided the BOSSBase dataset in their evaluation, making a fair comparison difficult. We believe that our experimental design is rigorous and that our results can be compared to the others as we only used the BOSSBase dataset without manipulation of the data (e.g., cropping or resizing the original images), and we did not make any compromises when testing ASSAF. For example, In this paper, we presented ASSAF, a novel steganalysis architecture for the detection of J-UNIWARD steganography within JPEG images. The ASSAF architecture is composed of a combination of two neural network architectures: a denoising autoencoder and a Siamese neural network. To the best of our knowledge, we are the first to combine a DAE and SNN into one architecture in order to solve a classification task. We evaluated ASSAF extensively on the BOSSBase dataset which contains 10,000 grayscale images. The BOSSBase dataset is popular in steganography research and serves as the baseline for all of the studies performed in the steganography and steganalysis domain. We conducted our experiments and evaluation of ASSAF on eight use cases which combine four different sized payloads (bpnzAC ratio) and two JPEG image quality factors. Our results demonstrate the superiority of the ASSAF architecture over existing state-of-the-art methods for the detection of J-UNIWARD steganography. ASSAF’S detection accuracy is between 0.803 (for the hardest use case: QF = 95 and bpnzAC = 0.1) and 0.993 (for the simplest use case: QF = 75 and bpnzAC = 0.4). ASSAF provides a relative improvement of 6%–40% over the top performing existing state-of-the-art method in the accurate detection of a J-UNIWARD steganography in a JPEG image. For instance, in comparison to SRNet (Boroumand et al., 2019) which achieved detection accuracy of 0.572 on the BOSSBase dataset for QF = 95 and bpnzAC = 0.1, ASSAF achieved detection accuracy of 0.803, an improvement of 40%. In addition, our architecture is much simpler than current state-of-the-art methods and thus is faster to train, has less overfitting, and is easier to scale. The issue of the resilience of steganalysis models to additive noise is a controversial topic in the steganalysis domain, as additive noise could be a form of spatial steganography or it could be added in order to impair the model’s performance. Some spatial steganography techniques (e.g., least significant bit – LSB Mielikainen, 2006) select the candidate embedding pixels randomly or pseudo-randomly, and the embedding is later done by changing the pixels’ values. Steganography can also be modeled as additive noise within the cover image (Harmsen, 2003). Other study (e.g., Singh & Shree, 2016) clearly state that the stegno in the JPEG domain is additive noise. In this manner, the 76 A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77 properties of several steganography techniques are similar to the additive noise properties which is changing pixels’ values in a random manner. Thus, additive noise can be considered a form of steganography (even if it does not contain a real payload), as the ‘‘noise’’ may contain real information. As a result, additive noise evaluations are not performed in steganalysis research, and to the best of our knowledge, such evaluations have not appeared in any other steganalysis papers. In addition, as digital images incorporate noise during image acquisition and processing (such as sensor noise, blur due to motion) (Zeng et al., 2018). It is worth mentioning that images with a higher noise level are better candidates for steganography and might already include secret embedded information, making this sort of discussion a controversial topic in the steganalysis domain. We believe that the current process of evaluating our model addresses this topic by including ‘‘benign’’ images that have some level of noise in both the training and test phases in which our proposed solution provided low FPRs, demonstrating the ability to handle with additive noise. Based on the results achieved in this research, we conclude that ASSAF has great potential for improving the detection of adaptive steganography in JPEG images. 7.1. Limitations ASSAF was trained and tested only on the BOSSBase dataset, which contains grayscale images at a fixed size of 512X512. Thus, the results of this study do not provide an indication of how the proposed method would perform on color images or different sized images. The analysis of color images requires a larger architecture that processes the three different image color channels: red, green, and blue (RGB). In addition, in order to cope with a larger image size, the proposed architecture must adapt the input image by cropping or resizing the original image to fit the architecture’s image input size, which might impact the steganography ‘‘signature’’ on the image. Those changes, therefore, increase the model’s complexity and would thus require much more computational resources to train. Those limitation should be further explored in future research. 7.2. Future work In future research, we plan to employ the ASSAF architecture for the detection of other forms of steganography and different image formats. We also plan to investigate the use of DAE for removing steganography from images. Another possible research vector focuses on the generalization of the architecture to support the detection of several payload sizes or various steganography techniques using a single model. And as mentioned earlier, we also propose investigating the influence of various image sizes and colors. Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Appendix See Fig. 12. Fig. 12. Detailed architecture of the SNN. References Ahmed, N., Natarajan, T., & Rao, K. R. (1974). Discrete cosine transform. IEEE Transactions on Computers, C–23(1), 90–93. http://dx.doi.org/10.1109/T-C. 1974.223784. Akhtar, N., Johri, P., & Khan, S. (2013). Enhancing the security and quality of lsb based image steganography. In Proceedings - 5th International conference on computational intelligence and communication networks (pp. 385–390). http://dx.doi.org/10.1109/CICN.2013.85. Bas, Patrick, Filler, T., & Pevný, T. (2011). Break our steganographic system: the ins and outs of organizing BOSS (pp. 59–70). Berlin, Heidelberg: Springer, http://dx.doi.org/10.1007/978-3-642-24178-9_5. Bas, P., & Furon, T. (0000). BOWS-2. Retrieved from http://bows2.ec-lille.fr. Boroumand, M., Chen, M., & Fridrich, J. (2019). Deep residual network for steganalysis of digital images. IEEE Transactions on Information Forensics and Security, 14(5), 1181–1193. http://dx.doi.org/10.1109/TIFS.2018.2871749. Cheddad, A., Condell, J., Curran, K., & Mc Kevitt, P. (2010). Digital image steganography: Survey and analysis of current methods. Signal Processing, 90(3), 727–752. http://dx.doi.org/10.1016/J.SIGPRO.2009.08.010. A. Cohen, A. Cohen and N. Nissim / Neural Networks 131 (2020) 64–77 Chen, M., Sedighi, V., Boroumand, M., & Fridrich, J. (2017). JPEG-Phase-aware convolutional neural network for steganalysis of JPEG images. In Proceedings of the 5th ACM workshop on information hiding and multimedia security (pp. 75–84). http://dx.doi.org/10.1145/3082031.3083248. Chopra, S., Hadsell, R., & LeCun, Y. (2005). Learning a similarity metric discriminatively, with application to face verification. In Proceedings - 2005 IEEE computer society conference on computer vision and pattern recognition. http://dx.doi.org/10.1109/CVPR.2005.202. Coatrieux, G., Pan, W., Cuppens-Boulahia, N., Cuppens, F., & Roux, C. (2013). Reversible watermarking based on invariant image classification and dynamic histogram shifting. IEEE Transactions on Information Forensics and Security, 8(1), 111–120. http://dx.doi.org/10.1109/TIFS.2012.2224108. Harmsen, J. J. (2003). Steganalysis of additive noise modelable information hiding. In Proc. SPIE electronic imaging. Holub, Vojtěch, & Fridrich, J. (2012). Designing steganographic distortion using directional filters. In WIFS 2012 - Proceedings of the 2012 IEEE international workshop on information forensics and security (pp. 234–239). http://dx.doi. org/10.1109/WIFS.2012.6412655. Holub, Vojtech, & Fridrich, J. (2015). Low-complexity features for JPEG steganalysis using undecimated DCT. IEEE Transactions on Information Forensics and Security, 10(2), 219–228. http://dx.doi.org/10.1109/TIFS.2014.2364918. Holub, Vojtěch, Fridrich, J., & Denemark, T. (2014). Universal distortion function for steganography in an arbitrary domain. Eurasip Journal on Information Security, http://dx.doi.org/10.1186/1687-417X-2014-1. Huang, F., Zhong, Y., & Huang, J. (2014). Improved algorithm of edge adaptive image steganography based on LSB matching revisited algorithm. In LNCS: vol. 8389, Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) (pp. 19–31). http: //dx.doi.org/10.1007/978-3-662-43886-2_2. Mielikainen, J. (2006). LSB Matching revisited. IEEE Signal Processing Letters, 13(5), 285–287. http://dx.doi.org/10.1109/LSP.2006.870357. Pevný, T., Filler, T., & Bas, P. (2010). Using high-dimensional image models to perform highly undetectable steganography. In LNCS: vol. 6387, Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) (pp. 161–177). http://dx.doi.org/10.1007/ 978-3-642-16435-4_13. Qin, C., Chang, C. C., Huang, Y. H., & Liao, L. T. (2013). An inpainting-assisted reversible steganographic scheme using a histogram shifting mechanism. IEEE Transactions on Circuits and Systems for Video Technology, 23(7), 1109–1118. http://dx.doi.org/10.1109/TCSVT.2012.2224052. Sharp, T. (2001). An implementation of key-based digital signal steganography. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), (vol. 2137) (pp. 13–26). Berlin, Heidelberg: Springer, http://dx.doi.org/10.1007/3-540-45496-9_2. 77 Singh, P., & Shree, R. (2016). A comparative study to noise models and image restoration techniques. International Journal of Computer Applications, http: //dx.doi.org/10.5120/ijca2016911336. Solanki, K., Sarkar, A., & Manjunath, B. S. (2007). YASS: Yet another steganographic scheme that resists blind steganalysis. In LNCS: vol. 4567, Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) (pp. 16–31). http://dx.doi.org/10. 1007/978-3-540-77370-2_2. Tan, S., & Li, B. (2014). Stacked convolutional auto-encoders for steganalysis of digital images. In 2014 Asia-pacific signal and information processing association annual summit and conference. http://dx.doi.org/10.1109/APSIPA. 2014.7041565. Taubman, D. S. (2002). JPEG2000: Image compression fundamentals, standards and practice. Journal of Electronic Imaging, 11(2), 286. http://dx.doi.org/10. 1117/1.1469618. Tsai, P., Hu, Y. C., & Yeh, H. L. (2009). Reversible image hiding scheme using predictive coding and histogram shifting. Signal Processing, 89(6), 1129–1143. http://dx.doi.org/10.1016/j.sigpro.2008.12.017. van der Spoel, E., Rozing, M. P., Houwing-Duistermaat, J. J., Eline Slagboom, P., Beekman, M., de Craen, A. J. M., et al. (2015). Siamese neural networks for one-shot image recognition. In ICML - Deep learning workshop. http: //dx.doi.org/10.1017/CBO9781107415324.004. Vincent, P. (2008). Extracting features with autoencoders. In Proceedings of the 25th international conference on machine learning (pp. 1096–1103). http: //dx.doi.org/10.1145/1390156.1390294. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P. A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research (JMLR), 11, 3371–3408. Wei, Wang (2018). New malware takes commands from memes posted on twitter. Retrieved April 3, 2019, from https://thehackernews.com/2018/12/ malware-twitter-meme.html. Xu, G. (2017). Deep convolutional neural network to detect j-UNIWARD. In IH and MMSec 2017 - Proceedings of the 2017 ACM workshop on information hiding and multimedia security (pp. 67–73). Association for Computing Machinery (ACM), http://dx.doi.org/10.1145/3082031.3083236. Zeng, J., Tan, S., Li, B., & Huang, J. (2018). Large-scale JPEG image steganalysis using hybrid deep-learning framework. IEEE Transactions on Information Forensics and Security, 13(5), 1200–1214. http://dx.doi.org/10.1109/TIFS.2017. 2779446. Zurada, Jacek M. (1992). Introduction to artificial neural systems. http://dx.doi.org/ 10.1016/0925-2312(92)90018-k.