홈으로ArticlesAll Issue
ArticlesDigital Image Forgery Detection Using Deep Autoencoder and CNN Features
  • Sumaira Bibi1, Almas Abbasi1, Ijaz Ul Haq2, Sung Wook Baik2,*, and Amin Ullah3,*

Human-centric Computing and Information Sciences volume 11, Article number: 32 (2021)
Cite this article 6 Accesses


In this digital era, forgery in images is very common, where copy-move and splicing forgery are the most popular types of image forgeries. In the current literature, most of the existing techniques detect these types of forgeries. However, most researchers have targeted only JPEG compressed images. Though, digital forensic techniques should not be particular to any image format. In deep learning, the convolutional neural network (CNN) and autoencoder are very popular methods to extract complex visual features in digital images. In the proposed method, multiple structures stacked autoencoders (SAE) are introduced for forgery detection in various image compression techniques, where the pre-trained AlexNet and VGG16 are utilized for image features extraction. The Ensemble Subspace Discriminant classifier is utilized for authentic and forged image classification. We performed an extensive ablation study on two CASIA datasets, where the results with two autoencoders and AlexNet features are dominant over all architectures and state-of-the-art methods, it achieved 95.9% accuracy for JPEG images and 93.3% for TIFF images.


Artificial Intelligence, Deep Learning, Machine Learning, Image Processing, Forgery Detection Applications, Convolutional Neural Network, Autoencoder and Stacked Autoencoder


Photography is an art, and a photograph or image is the illustration of the exterior form of a person, thing, and anything in nature. Technological innovations shift photography from its original basis to digital image. Now, in the digital era,an image is a part of the actual world designed after a series of images creation. Since the early days of photographic images, security and authenticity were the main problems. With the age of digital images, the procedure of changing an image has become very popular. Due to the availability of powerful editing software programs, anyone can manipulate images very easily. The process of altering or changing digital images and documents to mislead someone is called forgery[1]. With the advancement in image processing technology, the forged images and videos become so realistic that they cannot be detected with the human eye. For instance, recently DeepFake[2]has been introduced, which can make a synthetic video of any face; such a forgery can be used in crime scenes by forging surveillance camera footage. To stop DeepFake kind of synthetic changes, forgery detection can play an important role. According to Zhanget al.[3], in the digital age, there is a large volume of fake images on social media platforms and everywhere on the worldwide web. Digital forgeries have existed for several years and are still considered an interesting topic for researchers. The spreading of forged images can be distributed very simply and can also be used to deceive audiences from the truth and the outcomes may be very serious. Therefore, the validity of digital images is immediately required. For instance, Kashyap et al.[4]stated that, in crime scenes investigation and many different fields there are several routes for changing a picture, where the most common are splicing and copy-move.
According to Zhou et al.[5], image forgery is usually considered as a method of cropping and pasting regions on the same or discrete sources. This problem is categorized into two groups; one is copy-move and the second is splicing forgery. Copy an area from one portion of an image and then paste it onto another area in a similar image, called copy-move forgery. This kind of forgery is very popular. On the other hand, if the pasting area of the image is from an image other than pasted one, this process is known as image splicing. Otherwise, it is a copy-move forgery. It is a procedure of merging two or more images. This forgery is also called image composing.
Recently, many researchers have started to make efforts on the issue of digital forged images. Numerous procedures have been established to detect altering and forgery to confirm the validity of images. As stated byWarif et al.[6], authenticity and consistency (reliability) of digital images are important due to the easiness of altering digital images. Wang et al.[7] described that it has become a danger to security because anybody can get and alter image contents without leaving any noticeable clues. So, it has been proven that authenticity of digital images is a very severe problem, and it needs to be solved. JPEG and TIFF are very commonly used digital image file formats. According to Zhang et al. [3], in tampered and forged images analysis maximum work focus only JPEG compressed images. Many approaches are only precise to the JPEG file format[814]where they detect the tampered area based on the artifacts preoccupied by JPEG compression. However, digital forensics detection techniques must not be precise to any specific image format. It must be able to detect the tempered regions on other formats like JPEG and TIFF images. Many other solutions identify features based on a specific tampering procedure such as copy-move. Therefore, developing an accurate and strong method to detect these formats of forged images is immediately needed. The novelty of the proposed approach is that it resolves time complexity issues, and it can effectively detect every type of compressed image and forgery. Feature extraction at the backend using convolutional neural network (CNN) reduces time and these features after training using stacked autoencoders (SAE) can detect every type of forgery and compressed image. We have addressed these challenges with the following key contributions:
1) We proposed an effective method for three kinds of compression techniques with precise recognition accuracy. The proposed method uses the activations of fully connected layers of a pre-trained VGG16 and AlexNet CNN models. The features of the fully connected layers are the global and high-level representations of the image. We investigated features of two CNN models for forged image representation and proved via experiments that which model can learn tiny, altered patterns in the image.
2) The extracted features from a fully connected layer of CNN models are of very high dimensions, through which classification of binary classes such as forgery or non-forgery is not effective. Therefore, to reduce high dimensional space to low dimensional features, we have trained these features with multiple structure SAE to squeeze those features and make them able to classify into two classes, forged and authentic. Multiple structures SAE, with two and three autoencoders stacked are used respectively to reduce the dimensions of features.
3) Trained features from SAE are propagated forward to Ensemble Subspace Discriminant (ESD) classifier to get forged images.
The rest of the paper is organized as follows. Literature review is discussed in Section 2. The proposed methodology is given in Section 3. The experimental evaluation, testing and a discussion about results is given in Section 4. Section 5 includes conclusion of this article and recommends the future research guidelines.

Related Work

This section discusses a summary of earlier research about common forgery detection techniques in images including copy-move and spliced forgeries in context of different image compression formats.

Statistical Machine Learning-Based Techniques
The existing methods are divided into two major categories: block-based and keypoints-based systems to detect copy-move forgery. For instance, Wang et al.[7], presented a passive image copy-move tampering detection method. using local keypoints features. The drawback of this method is greater computational complexity and cannot be applied in real-time systems. In another method, Mahmoud andAl-Rukab[15]presented a brief introduction to moment and illustrated detection methods that are based on moments. The most important thing in these methods is the robustness to all types of attacks. Most of the methods succeeded in some cases like blurring and failed in rotation. Wang et al. [16]presented a copy-move tampering detection method. They used quaternion exponent moments (QEMs). Their results show effectiveness only on copy-paste forgeries in conditions like scaling and rotation. Warif et al. [6]surveyed the latest techniques in copy-move forgery detection. They described the ordinary copy-move forgery detection workflow using block and keypoints-based methods. They also discussed the datasets and authentications used in the literature. Lastly, they also summarized several future work directions. Kuznetsovand Myasnikov[17]proposed a technique based on hashing for copy-move detection that can be used to detect transformed duplicates areas due to a special preprocessing technique. Additional research contains developing new processing techniques to find more complicated forms of alterations. Zhao et al. [18]proposed a technique built on image segmentation plus an algorithm named swarm intelligence (SI). In this technique, they separated an image into little non-overlapping blocks. The SI algorithm is used to find the best possible parameters at every layer. Then with scale-invariant features transform (SIFT) based system, these parameters are applied to identify every layer. Their approach produced a greater false-positive rate. It should be resolved to enhance the effectiveness of their technique. Niu et al. [19] proposed a fast and accurate copy-move forgery detection algorithm, based on complex-valued invariant features.Jalab et al. [20]proposed a method in which they transformed image blocks into discrete wavelet transforms (DWTs) and for feature extraction they used a new fractional texture descriptor based on approximated Machado fractional entropy. They used a support vector machine (SVM) classifier only for spliced forgery classification. Huang et al. [21]presented a keypoints-based image forgery approach based on a superpixel segmentation algorithm and Helmert transformation to detect copy-move forgery. At first, they extract the keypoints and descriptors by applying a SIFT algorithm. Then using a descriptor, identical pairs will be taken by computing the similarity between keypoints. Based on spatial distance and geometric constraints through the Helmert transformation, they grouped these identical pairs to get rough fake regions. Then, they improve these rough fake regions and remove any faults. Lastly, the forged areas can be localized more precisely.

Deep Learning-Based Techniques
Deep learning-based techniques have outperformed state-of-the-art machine learning-based methods in many computer vision applications. In digital image forgery detection, Bunk et al. [22]proposed two approaches for detection of image forensics. In the initial technique, they presented an end-to-end system for detection. Then they performed localization in the digital manipulated images depending on radon transform and deep learning methods. Secondly, they applied a combination of resampling features; these features are based on maps. For classification, they used LSTM-based model for tampered areas. Zhang et al. [3]presented two phases deep learning method for features learning. In initial phase, they used a SAE, through which the model learns the features for every image patch. In the second phase, they additionally integrated the contextual information of the full area of the image. Similarly, Zhou et al. [5] proposed a technique that is based on a blocking approach. The processing unit of every block is a powerful CNN. This method is capable to detect the splicing forgery and revealed its efficiency in JPEG compression. Their experimental results only show its strength for JPEG compression. Kimand Lee[23]presented an image manipulation detection algorithm using a deep CNN model. This neural network is composed of four steps including one high-pass filter, two convolutional layers, two fully connected layers, and one output layer. For the experiments, they modified images using a median filtering method, Gaussian blur, additive white Gaussian noise and resized image to 256×256 dimensions. Ink mismatch finding is an important step in image forgery detection. For instance, Khan et al. [24]presented a deep learning method for ink gap detection in hyperspectral document pictures. They extracted spectral reactions of ink pixels from a hyperspectral document image, reformed to a CNN friendly image format and pass to the CNN for classification. Their proposed technique successfully identifies various ink types in hyperspectral document images for forgery revealing. In another method, Liu et al. [25]presented an efficient copy-move tempering recognition technique based on convolutional kernel network (CKN) which is a series of matrix computations and convolutional operations. Finally, proper pre- and post-processing for CKN are implemented to attain copy-move forgery detection. Cozzolino et al. [26]introduced a learning based forensic transfer (FT) based on autoencoder to distinguish between real and fake images. They trained autoencoder based approach on a source domain and separate the real and fake images in hidden space. For adaption to the target domain, they used some target training samples and hidden space to predict the class.Khalid and Woo[27]proposed a one-class classification model based on a variational autoencoder to detect fake human face images. They proposed two different approaches. In the first approach, they computed the reconstruction score directly from the input and output image using the same encoder and decoder building blocks.In the second approach, they added the additional encoder block follow the decoder, which takes the decoder output as input then they computed the reconstruction score between the first encoder and the second encoder output. Marra et al. [28]proposed a CNN based image forgery detection and localization method and made decisions from the whole image without resizing it. Their framework consists of three blocks: patch-wise feature extraction, image-wise features combination with multiple pooling strategies, and global classifications. Meena and Tyagi[29]presented a survey paper and discussed all types of image forgery detection techniques, i.e., image splicing detection, copy-move forgery detection, image resampling detection, and image retouching detection.Walia and Kumar[30]discussed different types of image forgeries, methods of image forgery detection and localization, novelty and shortcomings of these methods and major challenges. Li et al.[31] proposed a method called face R-ray for detecting forgery in face images and reveal the manipulated boundaries in forged images. They used a CNN HRNet in their framework. Dixit and Bag[32] proposed a complex framework that utilizedkeypoints matching using K-nearest neighbor via K-d tree to detect forged images. Abbas et al. [33]performed experiments using two deep learning models; smaller VGGNet and Mobile Net V2 for copy-move image forgery.In another survey,Saber et al. [34]discussedall types of forgeries, i.e., digital watermarking, digital signature, image splicing, image retouching and copy-move forgery. They discussed deep learning and CNN-based techniques. Yang et al. [35] proposed keypoints matching based method for copy-move forgery detection. In the current literature, most of the researchersaddressedtwo major problems: minimum accuracy and time complexity. Majority of the methods are limited to one kind of forgery detection andsome fail to identify some cases for example if the forgery part is rotated or scaled. There exist several systems that perform better for rotation to a certain level, but they are not capable of rotation of types, whichneed to be resolved in the future.

Proposed Methodology

In this section, we discussed the implementation procedure of the proposed framework for digital image forgery detection. The proposed approach resolves time complexity issues, and it can effectively detect every type of forgery and compressed image, that is the main issue in image forgery detection methods. If we directly input, the images to the SAE then it takes many hours to train and extract features; that is why we first extract features from images using pre-trained CNN and then these extracted features pass to the SAE as an input to train for forgery detection. Feature extraction at the backend using CNN reduces time and these features after training using SAE can detect every type of forgery and compressed image. Firstly, high-level features are extracted from images using pre-trained AlexNet and VGG16 CNN models; these extracted features are passed to the SAE for the training of forgery detection. Multiple experiments are performed with different CNN features and SAE structures.Fig. 1 shows the proposed framework for digital image forgery detection.

Fig. 1. The proposed framework for digital image forgery detection. Firstly, features are extracted from images using CNN model these extracted features pass to the SAE for squeezing the high dimensional features followed by forgery detection.

Loading raw image data directly into the autoencoder cannot detect forgeries properly and it takes more time. Many open sources and a pre-trained models nowadays are available, that are already trained on huge datasets and can extract features effectively. It is the right path to utilize these models for our work. Pre-trained CNN models including AlexNet and VGG16 are simple object classification models. In the proposed method, features are extracted from the images using these models; then these extracted features are passed to the SAE for forgery detection. After loading the data, the feature extraction procedure starts. CNN models VGG16 and AlexNet, are both pre-trained on a huge dataset known as ImageNet. The proposed method used these models for feature extraction, which are then passed to the proposed SAE for forgery detection. SAE modelautomatically trains the extracted features and gives us trained features for classification. Finally, theESDLearner is used for forged and authentic classification, CNN has shown its ability in a variety of computer vision works including image classification, image retrieval, object detecting, localization, and image segmentation. Effective usage of CNN for various tasks motivated researchers to use it in feature extraction. Training a deep CNN architecture needs massive quantity of data and extreme price for computational resources. Therefore, researchers proposed a solution for this problem by using the pre-trained CNN models. The parameters of these pre-trained models are trained on huge datasets. The features that are extracted from these models are extremely dominant and able to represent visual contents, which can help the model tolearn the complex features.

AlexNet CNN Model
AlexNet is the most popular pre-trained architecture in the era of deep learning. It has 60 million parameters, and eight layers including five convolutional layers, two fully connected and the last softmax layer. Each of convolutional layers has filters with a nonlinear activation function Relu and local response normalization. Out of five convolutional layers, the first, second and fifth layers go with max-pooling layers. The input size is fixed to 227×227 because of the fully connected layers, the input size should be fixed. The first convolutional layer has 11×11 filters (kernels) with the stride four and output is 96 feature maps of size 55×55. Then a max-pooling layer with 3×3 kernels is applied. The same operation is performed in the second convolutional layer with 5×5 filters with stride one and max pooling. The output of the second layer is 256 feature maps of size 27×27. The third, fourth and fifth layers have 3×3 filters. The third and fourth layers have one stride, and their output is 384 feature maps with size 13×13. The fifth layer has two strides and output is 256 feature maps with size 13×13. The fifth layer is followed by a max-pooling layer. Two fully connected layers are used with dropout, and last is thesoftmax layer. Dropout is applied to reduce overfitting. Two fully connected layers with 4096 outputs and the last softmax layer output are 1000 neurons. The AlexNet model contains 25 layers. In AlexNet every convolutional layer extractlocal features. The last three fully connected layers extract global features. Global features are a compact representation of every area of the image. The second last and third last fully connected layers extract 4096 features. The last fully connected layer is a compact representation of all layers, and it extracts 1000 features, and these are the main features that will be used for further processing. The 4096 features are high dimensional features; these extracted features are then reduced to 1000 features in the last fully connected layer. The 4096 deep features need high computational complexity, and 1000 features are suitable for this work, in terms of computational complexity, and the system can encode these features easily for forgery detection in SAE. These 1000 extracted features are then passed to the SAE network for training. Then trained features of SAE are used for classification.

VGG16 CNN Model
To train a CNN deep model a large quantity of data and very high price computational resources are needed. Pre-trained CNN models are the best solution for this problem. In the proposed technique a CNN model that is pre-trained, named VGG16 is selected for feature extraction from images, these features are deep. It is observed in tests that it can accomplish better accuracy and time efficiency. The architecture of VGG16 is shown in Table 1. The architecture of VGG16 is different from previous CNN models. Other models have 11×11 or 7×7 kernels including four or five strides. This kind of system raises the number of parameters in a model, and it will be more complex because the wide stride network can fail to detect the important patterns in the image because it can miss the important features. VGG16 contains 3×3 kernels for every convolutional layer. It helps to decrease the number of parameters and every pixel of the image convolves. Fixed size of 224×224×3 RGB image is the input to the network. Then the image is going through these convolutional layers taking size 3×3 receptive fields. In every convolutional layer a stride of one pixel is used. The VGG network has a total of 16 layers, 13 convolutional stacked layers and the last three layers are fully connected layers. All 13 layers of VGG16 network are settled in five stacks:two convolutional layers in the first two stacks. The remaining three stacks had three convolutional layers where every stack follows one pooling layer. In the end, these five stacks of convolutional layers then go to three fully connected layers. These three fully connected layers contain 4096 neurons in the second last and third last fully connected layer and the final layer has 1000 neurons. Table 1 illustrates the structure of VGG16.

Table 1. The overall structure of VGG16 CNN model
Layer Conv1a, Conv1b Max pool Conv2a, Conv2b Max pool Conv3a, Conv3b Conv3c Max pool Conv4a, Conv4b Conv4c Max pool Conv5a, Conv5b Conv5c Max pool FC6 FC7 FC8
Kernel size 3×3 3×3 3×3 3×3 3×3 1×1 3×3 1×1 Inner product 4096 Inner product 4096 Inner product 1000
Stride, pad 1, 1 1, 1 1, 1 1, 1 1, 1 1, 1 1, 1 1, 1
Channels 64   128   256 256   512 512   512 512  

Training using SAE
A deep autoencoder is an unsupervised feature learning type of artificial neural network framework. It is used to learn efficient data encodings in an unsupervised manner and mostly used for feature extraction automatically. The autoencoder has three layers,including input layer, hidden encoding layer and output decoding layer. Autoencoder initially reduces the input dimension then transforming the input data to compressed representation to employ successive layers. Then, for the successive layer, this compressed form of data is utilized to reconstruct the input with minimum error[36]. At the output, the autoencoder tries to reconstruct its input, through minimum reconstruction error. The output of the encoder section is taken as a high-level set of features for a classification task. In this paper, a SAE is presented to get the best performance. SAE is built by stacking many layers of basic autoencoders altogether. The output of every layer is linked as inputs to the successive next layer. The first layer of SAE model learns the first order’sfeatures, which are linked to the patterns of the deeper layers to extract more complex features. In terms of forgery detection for JPEG images, most methods detect the tampered areas that remain on objects by multiple JPEG compression[814]. However,TIFF introduces no compression artefacts, which is very challenging, but the proposed SAE effectively detects forgery in TIFF images. Our experiments show that it can detect forgery in both JPEG and TIFF images, also it detects both types of forgery copy-move and splicing forgery. In the first experiment, two autoencoders are stacked, and in the second experiment, three autoencoders are stacked. The motive behind this neural network is that it learns the features automatically according to the data.
Pre-trained models VGG16 and AlexNet are for general object image classification. Their parameters are trained on large-scale dataset, which can provide better representation for any visual data. Furthermore, deep CNN models need millions of images and high processing GPU for training. That is why features extraction from pre-trained CNN models is the best choice. In the proposed framework, these extracted features are passed to the SAE for image forgery detection. Generally, an autoencoder consists of two parts: an encoder and a decoder. In encoding, the model learns byreducing the input dimensions and then compresses the data into an encoded representation. In decoding the model learns but this time for reconstruction of the data from the encoded representation to make the original features to calculate the error. In encoding, it multiplies the data with “W” weights and adds biases “b” and executes it with a nonlinear function such as sigmoid function as given in Equation (1). Encoded data is then decoded to the same dimensions of inputs ad displayed in Equation (2). Weight settings are done using backpropagation to decrease the mean square error close to zero. In Equations (1) and (2) “x” is the data, “W” indicates the weight matrix and “b” indicates bias vector. Weights along with biases are initialized randomly and then modified iteratively through the backpropagation process during training.

$h (x)=sigm (Wx+b)$(1)

$\widehat{x}=sigm (W(h(x))+b)$(2)

In SAE, the first hidden layer requires input “x”while the other layers obtain input from the earlier hidden layer. The SAE is mathematically given in Equations(3) and (4). In these equations, ‘n’ is the number of layers for encoding x^l, W^l, and b^l are the data, weight matrix and biases vectors.

$h(x)^{l+1} =sigm (W^l x^l + b^l)$(3)

$\widehat{x}^{n+l+1} = sigm (W^{n-l} h(x)^{n+l} + b^{n-l}) $(4)

The SAE trains each layer of the network separately, while training one layer it freezes the weights of the other layers. In this paper, two experiments are performed.In the first experiment, two autoencoders are stacked and in the second experiment three autoencoders are stacked on top of each other.

Classification of Authentic and Forged Images
The proposed framework is tested on various classifiers where weachieved the best accuracythrough the ensemble of subspace discriminant classifiersAshour et al.[37]. Usually, the multiple-classifier methods or the ensemble-based techniques are more desirable compared to their single-classifier counterparts as they reduce the poor selection possibility. The ensemble classifier combines a set of classifiers that might produce superior classification performance compared to individual classifier. Trained features from SAE are used for classification through the ESD Learner. The subspace discriminant method uses an ensemble of 30 weak learners. The ESD classifier uses discriminant learners as a base classifier. ESD Learner takes predictions from weak discriminant learners and then makes a strong predictor. Predictions that are obtained from weak classifiers are merged and lead to better classification outcomes. This method uses cross-validation evaluates model’s prediction performance. It helps to solve overfitting problems. Each cross-validation involves randomly portioning the data into a training and testing set. To make the ESD classifier more reliable, 5-fold cross-validation is used. In every fold, 20% of the data is available for testing and 80% is for training.In 5-fold cross-validation, the model is trained and tested five times. Every time it takes different data for training and testing. The estimated error is calculated for all the folds and averaged to provide the final error estimation of the model.

Experimental Results and Discussion

In this section, we discussed different experiments and results achieved using 1,000 images randomly selected from the CASIA database, where 400 images are forged, and 600 images are authentic. These images are in JPEG and TIFF formats and contain copy-move and splicing forgeries. The performance of authentication method is evaluated through ensemble of subspace discriminant classifiers. Trained features from SAE are used for classification through the ESD Learner. The subspace discriminant method uses an ensemble of 30 weak learners, where theEnsemble Subspace classifier uses discriminant learners as a base classifier. The ESD Learner takes predictions from weak discriminant learners and then makes a strong predictor. Predictions that are obtained from weak classifiers are merged and lead to better classification outcomes. Accuracy, precision, and fallout are used as the performance evaluation measures and for comparison with the existing approaches. The accuracy value tells us how precisely our model predicts correct observations to the total observations and the value of precision tells us how correctly our model predicts positive observations to the total observations. To measure the performance of experiments, the proposed model is evaluated with different layers SAE. The first experiment uses two autoencoders with multiple hidden layers and second experiment uses three autoencoders with multiple hidden layers. The autoencoders of both experiments are trained up to 1,500 epochs, where the experiments show that the loss performance graphs of autoencoders move smoothly towards zero without going into overfitting issue. The following assumptions are considered during the evaluation.




True positives (TP):If the image is authentic and model predicts it as authentic.
True negatives (TN): If the image is forged and the model predicts it asforged.
False positives (FP):If the image is forged, but the model predicts it asauthentic.
False negatives (FN):If the image is authentic, but the model predicts it asforged.

CASIA version 1.0 and 2.0 datasets are used for evaluation of the proposed method. They are very common and complex datasets for forgery detection, which contain copy-move and splicing forged images. The CASIA dataset version 1.0 contains 1,721 color images in JPEG format, among which 800 images are authentic, and 921 images are forged. These images are of size 384×256 pixels. The forgery part includes resizing and rotation. The CASIA version 2.0 dataset is tougher, which includes 7,491 authentic images and 5,123 forged images. The size of these images varies from 240×160 to 900×600. Their image formats are JPEG, BMP, and TIFF.

Experiments with Two Layers SAE
The structure of two layers SAE is shown in Fig. 2. It takes 1,000-dimensional input features that are processed via 600 hidden units of the first encoder, then feeds 600-dimensional input to the second autoencoder, whichis extracted from the first autoencoder. The second encoder outputs 300-dimensional features vector, which is, then used for classification.
Fig. 2. Two layers stacked autoencoder architecture used for squeezing high-dimensional feature to low dimensions

The proposed SAE is trained for1,500 epochs to minimize the reconstruction error (squared error), often referred to as the loss. The mean squared error (MSE) along with L2 weight regularization and sparsity regularization is applied as a loss function. The L2 weights regularization helps to reduce overfitting trouble and retains the weights of the model small. This technique reduces overfitting and leads to quicker optimization of the model and improves overall performance. Due to overfitting the model tries to memorize the examples in the training data without understanding the concept and at the stage of testing it will fail to learn the real concept. So, the overfitting issue is resolved by utilizing the L2 weights regularization. The sparsity regularization is concentrated on choosing input variables that best define the output. The sparsity regularization is used along with a sigma value of 4. It shows that every feature (neuron) in the deep layer gets a regular output of 4 on the training examples. Sparsity proportion mentions the proportion of the number of zero features to the amount of all features in the matrix. In sparsity proportion if you import data then it means how many zeros (how much proportion) are for the weights. The sparsity proportion value is set to 0.1, that is suitable in this case. Workflow of experiments with two layers SAE is given in Fig. 3(a).
Fig. 3. Working flow of two ablation studies: (a) two layers autoencoder and (b) three layers autoencoder.

Performance of Two Layers SAE for JPEG and TIFF Images
Performance graphs after the training of 1st and 2nd autoencoders for JPEG and TIFF images using AlexNet at the backend are shown in Fig. 4(a) and 4(c). It can be observed that errors are reduced in both graphs effectively. For JPEG images, the error is reduced between 10-1 and 10-2 at epoch 801, which is recorded 0.018452 for the last training epoch. Similarly, for TIFF images, the error is reduced between 10-1 and 10-2at 843 epochs. It can be observed in Fig. 4(c), the error is 0.017984 for the last autoencoder of SAE of the training phase for TIFF images using AlexNet at the backend.
Performance graphs after the training of first and second autoencoders for JPEG and TIFF images using VGG16 at backend are shown in Fig. 4(b) and 4(d). For JPEG images, the error is reduced between 10-1 and 10-2 at epoch 421, which is recorded 0.016136 for the last training epoch. Similarly, for TIFF images, the error is reduced between 10-1 and 10-2 at 1,061 epochs. It can be observed in Fig. 4(d), the error is 0.015586 for the last autoencoder of SAE of the training phase.
Fig. 4. (a, b) present performance graphs of two layers SAE for JPEG images using AlexNet and VGG16 at the backend. (c, d) present performance graph of two layers SAE for TIFF images using AlexNet and VGG16 at the backend.
Fig. 5. Three layers stacked autoencoder architecture used for squeezing high-dimensional feature to low dimensions.
Experiments with Three Layers SAE
The experiments with the three autoencoders are also performed with 1,500 iteration of training.Fig. 5 shows the three autoencoders stacked with 1,000 input features and 600 hidden layers of the first encoder then second autoencoder takes inputs from the first autoencoder and uses 600 features, extracted from 1st autoencoder as an input. The third autoencoder takes 300 input features from second autoencoder. The third encoder has 100 hidden units, which produce the trained features that are finally used for classification. The autoencoders are trained to minimize reconstruction error (squared error), often referred to as a loss. The MSE along with L2 weight regularization and sparsity regularization is applied as a loss function, which is used for fine-tuning and weights. The workflow of experiments with three layers SAE with AlexNet and VGG16 is shown in Fig. 3(b).

Performance of Three Layers SAE for JPEG and TIFF Images
Performance graphs after the training of 1st, 2nd, and 3rdautoencoder for JPEG and TIFF images using AlexNet at thebackend are shown in Fig. 6(a) and 6(c). It can be observed that error is reduced in both graphs effectively. For JPEG images, the error is reduced between 10-3 and 10-4 at epoch 739, which is recorded 0.00092761 for the last training epoch. Similarly, for TIFF images, the error is reduced between 10-3 and 10-4 at 1,200 epochs. It can be observed in Fig. 6(c), the error is reduced up to 0.00091571 for the last autoencoder of SAE of the training phase for TIFF images using AlexNet at the backend.
Performance graphs after the training of 1st, 2nd, and 3rdautoencoders for JPEG and TIFF images using VGG16 at backend are shown in Fig. 6(b) and 6(d). It can be observed that error is reduced in both graphs effectively. For JPEG images, the error is reduced between 10-3 and 10-4 at epoch 719, which is recorded 0.00087222 for the last training epoch. Similarly, for TIFF images, the error is reduced between 10-3 and 10-4 at 1,083 epochs. It can be observed in Fig. 6(d), the error is reduced to 0.00065517 for the last autoencoder of SAE of the training phase for TIFF images using VGG16 at backend.
Fig. 6. (a, b) present performance graph of three layers SAE for JPEG images using AlexNet and VGG16 at the backend. (c, d) present performance graph of three layers SAE for TIFF images using AlexNet and VGG16 at the backend.

Discussion and Comparison with State-of-the-Art
This section demonstrates the results of test sets from the datasets, that are achieved through proposed trained models. Some of the test queries for JPEG and TIFF compression formats are shown in Fig. 7 along with ground truth and predicted class using the proposed model.
Fig. 7. Results achieved using the proposed method for JPEG and TIFF compressed images.

Table 2 shows the performance accuracy, precision, and fallout scores achieved for JPEG and TIFF image formats using the proposed technique with two layers SAE and three layers SAE, respectivly. High values of accuracy and precision show the better performance of classifier. The lower fallout indicates more precise results, which is very low for the proposed method. Among all the experiments the results with two layers autoencoders and AlexNet features are dominant over all architectures, where accuracy and precision values are high and fallout is low. Table 3 shows the comparison between previous works and proposed method, where the proposed method has achieved 3% increase in JPEG and 2% in TIFF image compression formats.

Table 2. The results achieved for experiments with two layers SAE and three layers SAE (unit: %)
Backbone model Image compression  Two layers SAE Three layers SAE
Accuracy Precision Fallout Accuracy Precision Fallout
AlexNet JPEG 95.9 95.2 7.2 94.3 94.4 8.5
TIFF 93.3 92.7 11.2 92.5 92.5 11.5
VGG16 JPEG 95.2 94.9 7.7 93.7 93.8 9.5
TIFF 92.1 93.2 10.2 91.5 92.7 11

Table 3. Comparison with state-of-the-art techniques
Method Accuracy (%)
Zhang et al. [3] 87.51 81.91
Bianchi and Piva[8] 40.84 -
Thing et al. [12] 79.72 -
Liu and Pun[38] 73.2
Li et al. [39] 92.38
He et al. [40] 89.76
Proposed method 95.9 93.3


Forgery detection is one of the challenging problems in the digital image era. The mainstream techniques for digital image forgery detection have two main limitations: accuracy/robustness for both JPEG and TIFF image formats and time complexity. Some methods are not very accurate to both copy-move forgery and spliced images, and most of the work focused only on JPEG image formats. Digital image forgery detection tools should not be limited to any image format and specific forgery. It should also be able to detect the tempered regions on other formats like JPEG and TIFF images. Recently, deep learning has shown substantial progress in different image processing fields. In this paper, we proposed a deep learning-based image forgery detection using multiple structures SAE and deep features extracted from pre-trained CNN networks including AlexNet and VGG16 models. The results with two autoencoders and AlexNet features are dominant over all other architectures; it achieved 95.9% accuracy for JPEG images and 93.3% for TIFF images. The reason we concluded is that, if the autoencoder is deeper then it alters the visual features, and the original representations get it wrong. Secondly, the AlexNet uses a different size of filters which extract local and global representations in image as compared to VGG16, which applies small filters throughout its network. Finally, for forged and authentic classification it achieved better performance using the ESD. Furthermore, our approach resolved the time complexity issue, and we only performed the classification task, which can be extended to localize forged locations in the image and achieve less fallout value which is also the main issue in image forgery detection.


Not applicable.

Author’s Contributions

Conceptualization, SB. Funding acquisition, SWB. Investigation and methodology, SB, IUH, AU. Supervision, AA, SWB. Writing of the original draft, SB. Writing of the review and editing, SWB, AU. Validation, AU, SB. Visualization, IUH.


This work is supported by the Institute for Information &Communication Technology Planning & Evaluation(IITP) funded by the Ministry of Science and ICT(MSIT, Korea) (No. 1711126258, Development of data augmentation technology by using heterogeneous data and external data integration).

Competing Interests

The authors declare that they have no competing interests.

Author bios

Sumaira Bibi received the MS degree in computer science from International Islamic University Islamabad, Pakistan in 2020.Her major research focus is on forgery detection, image and video analytics, machine learning and deep learning for multimedia understanding.

Almas Abbasi received the Ph.D. degree in computer science from the School of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia. She is currently an Assistant Professor with the Department of Computer Science and Software Engineering, International Islamic University, Islamabad, Pakistan. Her current research interests include image processing, information hiding, and mobile applications.

Ijaz Ul Haq received the bachelor’s degree in computer science from Islamia College Peshawar, Peshawar, Pakistan in 2016. He is currently pursuing the M.S. leading to Ph.D. degree, from Sejong University, Seoul, Republic of Korea and serving as Research Assistant and lab coordinator at Intelligent Media Laboratory (IM Lab). He has published several papers in reputed peer reviewed international journals and conferences including IEEE Transactions on Industrial Informatics, IEEE Access, Elsevier Future Generation Computer Systems, and Sensors. He is a student member of IEEE and providing professional review services in various reputed journals. His major research interests include video summarization, image and video analysis, image hashing, steganography, deep learning for multimedia understanding, and energy informatics.

Sung Wook Baik is a Full Professor at Department of Digital Contents, Chief of Sejong Industry-Academy Cooperation Foundation, and Head of Intelligent Media Laboratory (IM Lab), Sejong University, Seoul, Korea. He received the Ph.D. degree in Information Technology Engineering from George Mason University, Fairfax, VA, in 1999. He worked at Datamat Systems Research Inc. as a senior scientist of the Intelligent Systems Group from 1997 to 2002. Since 2002, he is serving as a faculty member at department of Digital Contents, Sejong University. His research interests include image processing, pattern recognition, video analytics, Big Data analysis, multimedia data processing, energy load forecasting, IoT, IIoT, and smart cities. His specific areas in image processing include image indexing and retrieval for various applications and in video analytics the major focus is video summarization, action and activity recognition, anomaly detection, and CCTV data analysis. He has published over 100 papers in peer-reviewed international journals in the mentioned areas with main target on top venues of these domains including IEEE IoTJ, TSMC-Systems, COMMAG, TIE, TII, Access, Elsevier Neurocomputing, FGCS, PRL, Springer MTAP, JOMS, RTIP, and MDPI Sensors, etc. He is serving as a professional reviewer for several well-reputed journals and conferences including IEEE Transactions on Industrial Informatics, IEEE Transactions on Cybernetics, IEEE Access, and MDPI Sensors. He is a member of IEEE. He is involved in several projects including AI-Convergence Technologies and Multi-View Video Data Analysis for Smart Cities, Effective Energy Management Methods, Experts’ Education for Industrial Unmanned Aerial Vehicles, Big Data Models Analysis etc. supported by Korea Institute for Advancement of Technology and Korea Research Foundation. He has several Korean and an internationally accepted patent with main theme of disaster management, image retrieval, and speaker reliability measurement.

Amin Ullah received Ph.D. degree in digital contents from Sejong University, South Korea. He is currently working as a Postdoc Researcher at the CoRIS Institute, Oregon State University, Corvallis 97331, Oregon, USA. His major research focus is on human action and activity recognition, sequence learning, image and video analytics, content-based indexing and retrieval, 3D point clouds, IoT and smart cities, and deep learning for multimedia understanding. He has published several papers in reputed peer reviewed international journals and conferences including IEEE Transactions on Industrial Electronics, IEEE Transactions on Industrial Informatics, IEEE Transactions on Intelligent Transportation Systems, IEEE Internet of Things Journal, IEEE Access, Elsevier Future Generation Computer Systems, Elsevier Applied Soft Computing, International Journal of Intelligent Systems, Springer Multimedia Tools and Applications, Springer Mobile Networks and Applications, and IEEE Joint Conference on Neural Networks.


[1] Y. Lee, S. Rathore, J. H. Park, and J. H. Park, “A blockchain-based smart home gateway architecture for preventing data forgery,” Human-centric Computing and Information Sciences, vol. 10, article no. 9, 2020. https://doi.org/10.1186/s13673-020-0214-5
[2] C. Z. Yang, J. Ma, S. Wang, and A. W. C. Liew, “Preventing deepfake attacks on speaker authentication by dynamic lip movement analysis,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 1841-1854, 2020.
[3] Y. Zhang, J. Goh, L. L. Win, and V. L. Thing, “Image region forgery detection: a deep learning approach,” in Proceedings of the Singapore Cyber-Security Conference (SG-CRC), Singapore, 2016, pp. 1-11.
[4] A. Kashyap, R. S. Parmar, M. Agrawal, and H. Gupta, “An evaluation of digital image forgery detection approaches,” 2017 [Online]. Available: https://arxiv.org/abs/1703.09968.
[5] J. Zhou, J. Ni, and Y. Rao, “Block-based convolutional neural network for image forgery detection,” in Digital Forensic and Watermarking. Cham, Switzerland: Springer, 2017, pp. 65-76.
[6] N. B. A.Warif, A. W. A. Wahab, M. Y. I. Idris, R. Ramli, R. Salleh, S. Shamshirband, and K. K. R. Choo, “Copy-move forgery detection: survey, challenges and future directions,” Journal of Network and Computer Applications, vol. 75, pp. 259-278, 2016.
[7] X. Y. Wang, S. Li, Y. N. Liu, Y. Niu, H. Y. Yang, and Z. L. Zhou, “A new keypoint-based copy-move forgery detection for small smooth regions,” Multimedia Tools and Applications, vol. 76, no. 22, pp. 23353-23382, 2017.
[8] T. Bianchi and A. Piva, “Image forgery localization via block-grained analysis of JPEG artifacts,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 3, pp. 1003-1017, 2012.
[9] I. C. Chang, J. C. Yu, and C. C. Chang, “A forgery detection algorithm for exemplar-based inpainting images using multi-region relation,” Image and Vision Computing, vol. 31, no. 1, pp. 57-71, 2013.
[10] Y. L. Chen and C. T. Hsu, “Detecting recompression of JPEG images via periodicity analysis of compression artifacts for tampering detection,” IEEE Transactions on Information Forensics and Security, vol. 6, no. 2, pp. 396-406, 2011.
[11] X. H. Li, Y. Q. Zhao, M. Liao, F. Y. Shih, and Y. Q. Shi, “Detection of tampered region for JPEG images by using mode-based first digit features,” EURASIP Journal on Advances in Signal Processing, vol. 2012, article no. 190, 2012.https://doi.org/10.1186/1687-6180-2012-190
[12] V. L. Thing, Y. Chen, and C. Cheh, “An improved double compression detection method for JPEG image forensics,” in Proceedings of 2012 IEEE International Symposium on Multimedia, Irvine, CA, 2012, pp. 290-297.
[13] W. Wang, J. Dong, and T. Tan, “Exploring DCT coefficient quantization effects for local tampering detection,” IEEE Transactions on Information Forensics and Security, vol. 9, no. 10, pp. 1653-1666, 2014.
[14] F. Zach, C. Riess, and E. Angelopoulou, “Automated image forgery detection through classification of JPEG ghosts,” in Pattern Recognition. Heidelberg, Germany: Springer, 2012, pp. 185-194.
[15] K. Mahmoud and A. H. A. Al-Rukab, “Moment based copy move forgery detection methods,” International Journal of Computer Science and Information Security, vol. 14, no. 7, pp. 28-35, 2016.
[16] X. Y. Wang, Y. N. Liu, H. Xu, P. Wang, and H. Y. Yang, “Robust copy–move forgery detection using quaternion exponent moments,” Pattern Analysis and Applications, vol. 21, no. 2, pp. 451-467, 2018.
[17] A. Kuznetsov and V. Myasnikov, “A new copy-move forgery detection algorithm using image preprocessing procedure,” Procedia Engineering, vol. 201, pp. 436-444, 2017.
[18] F. Zhao, W. Shi, B. Qin, and B. Liang, “Image forgery detection using segmentation and swarm intelligent algorithm,” Wuhan University Journal of Natural Sciences, vol. 22, no. 2, pp. 141-148, 2017.
[19] P. Niu, C. Wang, W. Chen, H. Yang, and X. Wang, “Fast and effective Keypoint-based image copy-move forgery detection using complex-valued moment invariants,” Journal of Visual Communication and Image Representation, vol. 77, article no. 103068, 2021. https://doi.org/10.1016/j.jvcir.2021.103068
[20] H. A. Jalab, T. Subramaniam, R. W. Ibrahim, H. Kahtan, and N. F. M. Noor, “New texture descriptor based on modified fractional entropy for digital image splicing forgery detection,” Entropy, vol. 21, no. 4, article no. 371, 2019. https://doi.org/10.3390/e21040371
[21] H. Y. Huang and A. J. Ciou, “Copy-move forgery detection for image forensics using the superpixel segmentation and the Helmert transformation,” EURASIP Journal on Image and Video Processing, vol. 2019, article no. 68, 2019. https://doi.org/10.1186/s13640-019-0469-9
[22] J. Bunk, J. H. Bappy, T. M. Mohammed, L. Nataraj, A. Flenner, B. S. Manjunath, S. Chandrasekaran, A. K. Roy-Chowdhury, and L. Peterson, “Detection and localization of image forgeries using resampling features and deep learning,” in Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, 2017, pp. 1881-1889.
[23] D. H. Kim and H. Y. Lee, “Image manipulation detection using convolutional neural network,” International Journal of Applied Engineering Research, vol. 12, no. 21, pp. 11640-11646, 2017.
[24] M. J. Khan, A. Yousaf, A. Abbas, and K. Khurshid, “Deep learning for automated forgery detection in hyperspectral document images,” Journal of Electronic Imaging, vol. 27, no. 5, article no. 053001, 2018. https://doi.org/10.1117/1.JEI.27.5.053001
[25] Y. Liu, Q. Guan, and X. Zhao, “Copy-move forgery detection based on convolutional kernel network,” Multimedia Tools and Applications, vol. 77, no. 14, pp. 18269-18293, 2018.
[26] D. Cozzolino, J. Thies, A. Rossler, C. Riess, M. Nießner, and L. Verdoliva, “Forensictransfer: weakly-supervised domain adaptation for forgery detection,” 2018 [Online]. Available: https://arxiv.org/abs/1812.02510.
[27] H. Khalid and S. S. Woo, “OC-FakeDect: classifying deepfakes using one-class variational autoencoder,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, 2020, pp. 656-657.
[28] F. Marra, D. Gragnaniello, L. Verdoliva, and G. Poggi, “A full-image full-resolution end-to-end-trainable CNN framework for image forgery detection,” IEEE Access, vol. 8, pp. 133488-133502, 2020.
[29] K. B. Meena and V. Tyagi, “Image forgery detection: survey and future directions,” in Data, Engineering and Applications. Singapore: Springer, 2019, pp. 163-194.
[30] S. Walia and K. Kumar, “Digital image forgery detection: a systematic scrutiny,” Australian Journal of Forensic Sciences, vol. 51, no. 5, pp. 488-526, 2019.
[31] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, and B. Guo, “Face x-ray for more general face forgery detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, 2020, pp. 5001-5009.
[32] A. Dixit and S. Bag, “A fast technique to detect copy-move image forgery with reflection and non-affine transformation attacks,” Expert Systems with Applications, vol. 182, article no. 115282, 2021. https://doi.org/10.1016/j.eswa.2021.115282
[33] M. N. Abbas, M. S. Ansari, M. N. Asghar, N. Kanwal, T. O'Neill, and B. Lee, “Lightweight deep learning model for detection of copy-move image forgery with post-processed attacks,” in Proceedings of 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), Herl'any, Slovakia, 2021, pp. 000125-000130.
[34] A. H. Saber, M. A. Khanl, and B. G. Mejbel, “A survey on image forgery detection using different forensic approaches,” Advances in Science, Technology and Engineering Systems Journal, vol. 5, no. 3, pp. 361-370, 2020.
[35] J. Yang, Z. Liang, Y. Gan, and J. Zhong, “A novel copy-move forgery detection algorithm via two-stage filtering,” Digital Signal Processing, vol. 113, article no. 103032, 2021. https://doi.org/10.1016/j.dsp.2021.103032
[36] S. U. Khan, T. Hussain, A. Ullah, and S. W. Baik, “Deep-ReID: deep features and autoencoder assisted image patching strategy for person re-identification in smart cities surveillance,” Multimedia Tools and Applications, 2021. https://doi.org/10.1007/s11042-020-10145-8
[37] A. S. Ashour, Y. Guo, A. R. Hawas, and G. Xu, “Ensemble of subspace discriminant classifiers for schistosomal liver fibrosis staging in mice microscopic images,” Health Information Science and Systems, vol. 6, article no. 21, 2018. https://doi.org/10.1007/s13755-018-0059-8
[38] B. Liu and C. M. Pun, “Locating splicing forgery by fully convolutional networks and conditional random field,” Signal Processing: Image Communication, vol. 66, pp. 103-112, 2018.
[39] C. Li, Q. Ma, L. Xiao, M. Li, and A. Zhang, “Image splicing detection based on Markov features in QDCT domain,” Neurocomputing, vol. 228, pp. 29-36, 2017.
[40] Z. He, W. Lu, W. Sun, and J. Huang, “Digital image splicing detection based on Markov features in DCT and DWT domain,” Pattern Recognition, vol. 45, no. 12, pp. 4292-4299, 2012.

About this article
Cite this article

Sumaira Bibi1, Almas Abbasi1, Ijaz Ul Haq2, Sung Wook Baik2,*, and Amin Ullah3,*, Digital Image Forgery Detection Using Deep Autoencoder and CNN Features, Article number: 11:32 (2021) Cite this article 6 Accesses

Download citation
  • Recived18 February 2021
  • Accepted18 July 2021
  • Published15 August 2021
Share this article

Anyone you share the following link with will be able to read this content:

Provided by the Springer Nature SharedIt content-sharing initiative

  • Artificial Intelligence
  • Deep Learning
  • Machine Learning
  • Image Processing
  • Forgery Detection Applications
  • Convolutional Neural Network
  • Autoencoder and Stacked Autoencoder