ArticlesAll Issue
ArticlesTwitter Spam Detection via Bilinear Autoencoding Reconstruction Error
• Qian He1, Sun Zhang2, Bo Li2, and Chunyong Yin2,*

Human-centric Computing and Information Sciences volume 12, Article number: 27 (2022)
https://doi.org/10.22967/HCIS.2022.12.027

Abstract

Recent years have witnessed the developments in online social network that has provided convenient services of instant communication, social interaction, and information dissemination. Billions of messages are transmitted through online social websites and mobile social applications in every day, where spam messages are also spread rapidly by illegal accounts for malicious purposes that would threaten the quality and security of social network. Therefore, we propose an auto-encoding bilinear classifier to achieve twitter spam detection based on reconstruction error. Firstly, the raw hand-crafted features are considered as the global information, and multiple convolution kernels are employed to extract the local information as additional features. Then, two types of features are utilized to learn the reconstructed representation in an auto-encoder, which is expected to capture only the distribution of normal data, and correctly reconstructed the original sample if it meets the normal pattern. Consequently, the reconstruction error is a feasible indicator for spam detection. The bilinear mapping is exploited to capture the pairwise difference, and a feed forward network is utilized to identify the spam data based on the evidence from the bilinear mapping. A cosine reconstruction loss is designed to minimize the reconstruction error of normal pattern, while maximize that of spam pattern. Qualitative and quantitative experiments have been conducted on two real datasets. The experimental results have demonstrated our proposed model can achieve the competitive performances for twitter spam detection.

Keywords

Bilinear Mapping, Reconstruction Error, Spam Detection, Social Network

Introduction

Recent years have witnessed the surge of online social network, which has attracted a large number of registered users with convenient services, flexible access, and rich content. Nowadays, social network has become not only a platform of social interaction, but also a new medium of information dissemination. However, the rapid dissemination of messages has also brought new challenges to social network security. Online users may receive numerous messages each day in an active or passive manner. It is difficult to distinguish the security of received messages, which could be full of advertisements, frauds and other spam information. Malicious and fake messages are usually spread from illegal accounts that are automatically registered by scripts and programs for illegal purposes. Such automation scripts are also known as social bots [1] that randomly send friend requests to massive users. If the victim accepts the request, social bots would attempt to start chat sessions for collecting personal privacy or sending spam messages with URLs directed to external malicious websites. A study of 13 million spam emails also found more than 100 thousand samples contained malicious attachments, and about 1.4 million samples had malicious links [2].
Security threats have already been noticed by Twitter, and the filtering rules are predefined on Twitter servers for spam detection. For instance, an account would be suspected as malicious by the servers if it sends a large number of friend requests, duplicate tweets, or link-only tweets [7]. Although the filtering rules can detect spam tweets quickly, these hand-crafted rules always lag in the game with attackers and have a higher missing alarm rate. For example, a spam account usually follows a large number of users randomly, while it has few followers. Rule-based methods can utilize this feature to identify spam accounts. However, the attackers can disguise their accounts by buying followers from the third-party service providers. Alazaband Broadhurst [2] also pointed out that the growth of Internet can cause the increased sophistication of malicious software and tools. Therefore, it is required for better feature extraction and automatic detection methods for more complex spam patterns.
Current studies are based on different machine learning or deep learning models for automatic spam detection. For example, Cheng et al. [8] employed logistic regression to detect malicious users in Twitter, and found that malicious users can be recognized according to the length, complexity, and diversity of screen names. Washha et al. [9] proposed the unsupervised collective-based framework for real-time spam tweet detection, which had solved the problem of concept drift. The approach based on Markov random field was proposed in [10] for social spam evolution problem, in which the biased and inaccurate prior predictions were tolerated by the probabilistic graphical model for higher recall.
Machine learning-based methods have improved the performance of spam detection effectively. But these methods require expert knowledge and are vulnerable for massive data. Decision tree is usually suitable for discrete features, but there are many continuous statistics in Twitter data. Although continuous features can be transformed into discrete features by discretization methods, it might cause information loss in the process of transforming. Logic regression learns parameters by maximum likelihood estimation, which is appropriate for continuous features, but sensitive to the correlation between features. Support vector machine (SVM) is widely utilized in text processing by minimizing the structure risk, but it is only fit for small-scale datasets. The decision function of SVM is determined by few support vectors which makes it difficult to handle massive samples.
Deep learning methods are known as automatic feature extraction and parameter optimization from massive data. Sicato et al. [11] had made a comprehensive analysis to evaluate the impact of deep learning on security issues. Common network architectures of deep learning include deep neural network (DNN), convolution neural network (CNN), recurrent neural network (RNN), and auto-encoder. DNN achieves the multi-layer nonlinear mapping of original features by stacked fully connected layers and it is widely applied in classification tasks. CNN is appropriate for image, video, and other multimedia processing. The high-level local spatial information is extracted by CNN through performing convolution operation on original features [12]. The temporal features are extracted by RNN and sequential information is memorized in hidden states which are applicable for audio, text, and other sequential data. The auto-encoder exploits a structure of encoder-decoder to compress high-dimensional features into a low-dimensional space for information refining, or map low-dimensional features to a high-dimensional space for capturing abundant information. Recent years have witnessed their applications in various fields. In commercial fields, Asghar et al. [13] developed a fuzzy-based eSystem with DNN to provide a feasible solution for customer feedback and satisfaction. In healthcare fields, Lee et al. [14] proposed a multi-class classification model with neural networks to monitor the events from healthcare environments. In security fields, Rathore and Park [15] proposed a secure learning scheme for cyber-physical systems, where deep learning methods were operated in a decentralized and secure manner.
Based on the aforementioned analysis and comparison, the auto-encoding bilinear classifier (ABC) is proposed for spam detection in this paper. Inspired by the work of Tan et al. [16], the local features are extracted by multiple one-dimensional convolution kernels, while Tan et al. [16] employed a sliding window to perform local feature extraction. Original features are considered as the global information and input into the auto-encoder with local features for learning the reconstructed representation. The auto-encoder is expected to reconstruct normal data correctly while generate higher reconstruction errors for spam data. Consequently, the difference between the original and reconstructed representation becomes an indicator for spam detection. Then, the bilinear mapping is introduced to capture the pairwise difference between reconstructed and original representations. Lastly, a feed forward network composed of stacked fully connected layers is utilized to identify the spams based on the evidence from bilinear mapping. The entire model is optimized by cosine reconstruction loss that increases the reconstruction error for spam pattern, while reduces that for normal pattern. Two datasets collected from Twitter are employed for evaluation and analysis. The first one is collected by Lee et al. [17] and named as Social Honeypot (https://infolab.tamu.edu/data/#social-honeypot-dataset). The other one is 6 Million Spam Tweets (http://nsclab.org/nsclab/resources/) published by Chen et al. [5]. In addition, we have constructed novel user-based features for Social Honeypot dataset. The empiric results have demonstrated that the extended features can improve the performances both deep learning and machine learning methods. The main contributions are three-fold:

We proposed to employ the reconstruction error as the indicator of spam detection, and design the auto-encoding bilinear classifier to capture the pair-wise error between the reconstructed representation and raw inputs. Empiric results on two twitter spam datasets have demonstrated its competitive performance.

Inspired by the work of [16], we propose to extract local features by convolution kernels with different size as a supplement of global features. Quantitative experiments and ablation study have proved its positive impact on spam detection.

We design a cosine reconstruction loss for model optimization, which can increase the reconstruction error between the modeling pattern and spam pattern for easier identification.

The remains of this paper are organized as follows. Common features and models using in spam detection are reviewed and summarized in Section 2. The extended features of Honeypot dataset and our improved methods are introduced in Section 3. Section 4 shows the experimental results and corresponding analysis on two datasets. The conclusion and future works are discussed in Section 5.

Related Work

Two hundred million users have shared about 400 million tweets per day on Twitter. A tweet is defined as a short message that can be easily understood by users. The content of tweets could contain arbitrary characters, in which “#words” is a hash tag utilized to aggregate tweets on specific topics. “@username” is a symbol known as a mention which could directly push messages to a specific user, even there is no following or friend relationship between users. Tweets also contain URLs that may direct users to external malicious websites, such as phishing, Trojan horse, advertising, etc.
A tweet spam is defined as a tweet contains URLs directed to malware downloading, phishing, drug trafficking, fraud, and other external malicious websites [6]. According to this point of view, spam detection can be implemented by judging whether malicious URLs are in the tweet or not. Some online social websites will send a warning when users click a link in the tweet and are redirect to the external website. Another common method is to create a blacklist of known malicious URLs and filter each tweet. However, attackers can modify the characters of URLs in the tweet through Short URL services and exploit a third-party website for redirecting. These short URLs are usually composed of meaningless characters and can be changed frequently. Consequently, it is difficult to filter the changeable malicious URLs with blacklists.
Given a set of k users{$u_1,u_2,…,u_k$}, where each user u_i has a feature set $F_(u_1 )={f_1,f_2,…,f_m }$. These features are extracted from the accounts, contents, network structures, etc. The problem of spam detection in social network is defined as training a classifier c to predict the label (spammer or legitimate) of the user with the feature set. Conventional detection methods for spam emails usually focus on the content of emails. Although attackers can be identified, they are not banned from sending spam emails to others. Different from spam email detection, the spam detection for social network not only detects spam messages, but also identifies and removes spam accounts.
Most existing methods for spam detection are based on various machine learning algorithms to build a classifier with hand-crafted statistical features. These features can be grouped into five categories as described in Table 1.

Table 1. Common hand-crafted statistical features for spam detection

Category Description Examples
user-based The basic attributes of registered accounts, may be restricted by privacy policy. Screen name, number of followings and followers.
content-based Features about historical messages of users, including semantic and grammatical features. News, topics, emotions, structure, theme, symbol usage, symbol meaning.
network-based The attributes extracted from the relationship network between users. Connectivity relations, structure metrics, in-degree, out-degree, clustering coefficient, centrality.
activity-based The pattern of individual or network behaviors. The number, frequency, and regularity of publishing messages, the interactions with friends.
auxiliary Other characteristics exploited in spam detection. The change of content, network, activity features in time scale, the comments from special items or users.
Masood et al. [18] also proposed a survey about the features applied in spam detection. They mentioned that the depth of conversation tree, composed of comments from other users, can be utilized for fake news detection [19], because the comments from other users can reflect the quality and authenticity of the news in a similar motivation of crowd-sourcing. Hakak et al. [20] also employed the location, events, laws, and other context-related features to recognize fake news. Therefore, the features that can be exploited in spam detection are not limited to the content in Table 1, and feature extraction still plays a crucial role in our study.
In recent years, deep neural networks with diverse architectures have been applied in spam detection and security issues, such as Makkar and Kumar [21] had proved a web cognitive spammer framework based on deep learning. Long short-term memory (LSTM) network was trained with link-based features, and neutral network was trained with content-based features. Tang et al. [22] proposed that behavior features are effective in spam review detection, but it takes time to track and collect features, which are vulnerable to cold-start spam reviews. They proposed to synthesize behavior features from accessible features by adversarial training. An integrated approach of CNN and LSTM was proposed for text spam detection in short message service (SMS) [23], which had shown the excellent performance of deep learning methods on spam detection task. Liu et al. [24] proposed a complex probabilistic graph method to capture the relation between opinion reviews. The features of reviewers and products were incorporated to learn the embedding representation of related reviews. Park et al. [25] proposed a scalable network architecture for smart city, where deep learning technologies provided the architecture with cognitive ability. Sharma et al. [26] developed an efficient and accurate architecture for smart home, in which the packet detection method was exploited to prevent security threats. Similarly, Sicato et al. [27] paid attention to the threat from malware attacks in smart home, and proposed an intrusion detection method as the defense mechanism. Since spam detection is a hot issue in the community, we briefly summarize some representative methods from 2018 to 2022 [6, 7, 9, 10, 16, 20-22, 24, 28-34], where these methods depend heavily on different categories of features, as shown in Table 2.

Table 2. Summary of current methods for spam detection from 2018 to 2022
Study Year Basic model Feature category Limitations
Fazil and Abulaish[6] 2018 Random forest, decision tree, Bayesian network user-based, content-based, network-based, activity-based heavily relying on features, limited scalability
Madisetty and Desarkar[28] 2018 CNN user-based, content-based low recall and F1 score
Inuwa-Dutse et al. [29] 2018 Latent semantic analysis user-based, content-based, network-based low efficiency caused by concept retrieval
Washha et al. [9] 2019 Non-negative matrix factorization user-based, content-based without control of cached data, high space occupation
Liu et al. [24] 2019 Bayesian network user-based, content-based vulnerable for modality missing
Kabakus and Kara [30] 2019 Naïve Bayes (NB) user-based, content-based potential hypothesis of feature independence
El-Mawass et al. [10] 2020 Markov random field (MRF) user-based, content-based, activity-based long execution time, high complexity
Tan et al. [16] 2020 Decision tree content-based sensitive to the size of sliding window
Makkar and Kumar [21] 2020 LSTM user-based, activity-based high space occupation caused by ensemble models
Tang et al. [22] 2020 Generative adversarial network activity-based unstable training process, part of access-restricted features
Hakka et al. [20] 2020 CNN, LSTM content-based large number of learnable parameters
Seyler et al. [7] 2021 CNN, LSTM content-based requiring domain knowledge about spam words
Ahmad et al. [31] 2021 SVM user-based, content-based, network-based, activity-based sensitive to kernel selection
Abkenar et al. [32] 2021 Random forest user-based, content-based only evaluated on one dataset
Kiruthika Devi and Sathish Kumar [33] 2022 CNN user-based, content-based low performance compared with baselines
Liu et al. [34] 2022 CNN, LSTM content-based neglecting discrete and user-based features
Proposed model (ABC) 2022 CNN, auto-encoder user-based requiring annotated labels, long execution time

Auto-Encoding Bilinear Classifier

Motivation
Feature extraction is well known as an indispensable part of spam detection, and it has a direct impact on the final performance of any models. As the summary in Table 2, existing methods are mostly based onvariant hand-crafted features, which are extracted with prior knowledge and expert experience. Automatic representation learning has also dominated diverse fields, with the rise of deep learning. Motivated by the work in [16], we treat raw manufactured features as the global information, and employ convolution kernels with different sizes to extract the local information as the supplementary features. It is distinct from the process in [16], where a sliding window with fixed size was exploited for local feature extraction.
In addition, spam detection is mostly considered as a classification task, and the classifiers are applied in an end-to-end manner, where the intermediate process is hard to be understood. We propose to identify the spam data according to reconstruction errors, and employ a bilinear mapping layer to capture the pair-wise error between the reconstructed representation and raw inputs. To increase the reconstruction error for easier identification, we employ the auto-encoder architecture to learn the normal pattern, and design the cosine reconstruction loss for encouraging the model pattern to be close to the normal sample pattern, while to be away from the spam one.

Spam Detection Framework
As shown in Fig. 1, a common framework of spam detection consists of four modules, including data collection, feature representation, spam detection, and result evaluation. Data collection is the most basic and vital part of the whole framework. Twitter allows to access tweets and some account-related attributes with the official Twitter API. The requested data are stored in JavaScript Object Notation (JSON) format and Twitter API provides four objects: Tweets, Users, Entities, and Places. Each object has several related attributes, in which Tweets contain the basic information of each tweet, such as the created date, the number of retweets and like. Users are attributes related to the account, including the number of followings, followers, etc. Entities provide the features related to the contents of each tweet and other information extracted from the content. Places refer to the location of each account, but this attribute can be modified easily. Some users might be unwilling to present their real location for security consideration, so the location is not usually used for spam detection. But the location of users is valuable for other services. For example, Zheng et al. [35] utilized twitter's location to predict the future location and recommend the points of interest (POI). The accounts of Twitter are usually public which means most information related to users could be accessed by Twitter API, but some users can hide their tweets or accounts that are called protected accounts. For those protected accounts, it is difficult to collect detailed information about Tweets, but some attributes of Users can still be accessed. Considering the limitation of protected accounts, user-based features are selected for spam detection in our works.

Fig. 1. Common framework of spam detection.

Feature Construction
Social Honeypot dataset was collected by Lee et al. [17] during December 30, 2009 to August 2, 2010. The samples are manually labeled as polluter and legitimate. The records of each account are stored in three files: “content_following.txt,” “content_tweets.txt,” and “content.txt.” In our experiments, we select all features in “content.txt” and construct new features as the supplement of user-based features. The original features are: user id, account created date, collection date, the number of followings, followers, tweets, the length of screen names and profiles. Four extended features are constructed for better spam detection. Notice that, all the denominators are added with an integer to avoid the invalid division which also conforms to the Dirichlet uncertainty assumption. The additional extended features are as follows.
(1) Age: The age of each account is the days during creating date to January 1, 2019.
(2) Reputation: It is the ratio of the number of followers to the total number of followings and followers:

$Reputation = \frac{\#follower+1}{\#follower+\#following+2}$(1)

(3) FF-ratio: The ratio of followings and followers is exploited to identify spammers as discussed in previous sections. Spammers usually follow a large of users randomly, but they cannot obtain others’ following. Therefore, this characteristic is selected to evaluate the authenticity of the accounts.

$FF-ratio = \frac{\#following+1}{\#follower+1}$(2)

(4) Freq-tweet: It refers to the frequency of daily tweet publishing and can be obtained from the number of tweets and date interval. A spammer usually publishes a large number of duplicated tweets and the frequency will be much higher than the legal ones.

$Freq-tweet = \frac{\#tweet}{collect_date-create_date}$(3)

Fig. 2. Correlation coefficients of pair-wise features.

Social Honeypot dataset is extended by four constructed features and the correlations of pair-wise features are visualized in Fig.2. The correlation coefficient is an attribute value ranging from -1 to +1. Positive correlation coefficient means two features are positively correlated and the increase of one feature will cause the increase of the other one. In contrary, negative correlation coefficient indicates the increase of one feature will cause a decrease of the other one. As shown in Fig.2, higher absolute correlation scores with target label (“is_polluter”) are the number of followings, the length of screen names, the length of profile texts, age, reputation, and FF-ratio. Age and reputation have the highest correlation scores and they are both negative values. The number of followings is significantly correlated with the target label, while the number of followers is not. This phenomenon suggests that spammers could buy followers from third-party websites or follow each other to increase the number of followers. According to the correlations between features, the profile of a twitter spammer can be summarized as: the account usually is just registered, using long account name and profile description (e.g., some meaningless or duplicate characters), and randomly following a large number of users.

Classifier Design
The auto-encoding bilinear classifier is proposed for spam detection and the architecture of the classifier is shown in Fig. 3. It is composed of local feature extraction, auto-encoder, bilinear mapping, and feed forward network.

Fig. 3. Architecture of the auto-encoding bilinear classifier.

Fig. 4. Distinct local features are extracted by multiple convolution kernels with different size.

Tan et al. [16] utilized a sliding window to extract local information after feature construction. The sliding window is essentially a special one-dimensional convolution kernel which is usually applied for sequential data. As shown in Fig.4, convolution kernels with different sizes can extract distinct local information and multiple convolution kernels are employed for local feature extraction.
Given the training samples $X={x_1,x_2,…,x_n}∈R^(m×n)$ and target labels $Y={y_1,y_2,…,y_n}∈R^n$, each sample has $m$ features $x_i∈R^m$ and $y_i∈{0,1}$ indicates it is a binary classification problem. In our experiments, the stride length and channel number are fixed to 1 with zero padded. k convolution kernels {$conv_1,conv_2,…,conv_k$} with corresponding size {$l_1,l_2,…,l_k$} are utilized for local feature extraction. For convolution kernel con $v_i$, the dimension of extracted local feature is $m-l_i+1$.
All the results output from multiple convolution kernels are concatenated as the local feature representation that is input into the auto-encoder with original features. The auto-encoder is expected to model only the distribution of normal samples that means it can correctly reconstruct normal samples while having higher reconstruction error for spam samples. The bilinear mapping is utilized to project origin and reconstructed representations into the same vector space for better pair-wise comparison. A new vector representation carrying error information is generated from bilinear mapping layer for further spam detection. Then, the feed forward network, composed by stacked fully connected layers, is employed to identify spam data. The output result is activated by sigmoid function considering it is a binary classification problem.

$recons_loss=y_i cos(x_i,x_{i'})-(1-y_i)cos(x_i,x_{i'})$(4)

The cosine reconstruction loss is designed for the optimization of the auto-encoder as shown inEquation (4), where cos(x_i,x_(i^' )) denotes the cosine similarity between origin representation x_iand reconstructed representation x_(i^' ). The auto-encoder is expected to correctly reconstruct input sample x_i when the sample is normal y_i=0. In contrary, a low cosine similarity is expected when y_i=1. The cross entropy is employed as the discrimination loss as follows:

$discri_loss=-y_i \hat {log⁡y}_i-(1-y_i)log⁡( 1-\hat y_i)$(5)

where $\hat y_i$ is the prediction result. Finally, the objective of the whole model is to minimize the loss:

$argmin \frac{1}{n} \displaystyle\sum_{i=1}^n y_i(cos(x_i,x_{i'})-\hat {log⁡y}_i)-(1-y_i)(cos(x_i,x_{i'})+log⁡(1-\hat y_i)$(6)

The pseudo-code of training process is show in Algorithm 1. And all the symbols mentioned in this section are summarized in Table 3.
Algorithm 1 Model training with cosine reconstruction loss
Input: training set X and target labels Y
Output: ABC model
1: While not convergence do
2:   for each sample ($x_i, y_i$) in training set do
3:     construct local features with convolution kernels {$conv_1, … , conv_k$}
4:     concatenate raw and local features
5:     generate reconstructed representation $x_i$
6:     calculate cosine similarity($x_i, y_i$)
7:     predict the label $\hat y_i$
8:     calculate reconstruction and discrimination loss as equation (4) and (5)
9:     optimize the model according to the objective function in equation (6)
10:   end for
11: end while

Table 3. Summary of symbols
Symbol Value type Description
$X, Y$ matrix The sample set, and the corresponding target labels
$m, n$ scalar The amount of samples, and the dimension of raw features
$x_i, y_i$ vector A record and its label from sample set
$conv_i$ function A convolution kernel for local feature extraction
$l_i$ scalar The size of i-th convolution kernel
$k$ scalar The number of convolution kernels
$x_{i’}$ vector Reconstructed representation from the auto-encoder
$cos$ function A function to measure the similarity of two vectors
$\hat y_i$ vector The prediction result of model

Performance Analysis

Honeypot and 6 Million datasets are employed to evaluate the proposed method. All the codes are written in Python 3.6 and PyTorch 1.4.0 is utilized as the deep learning framework. All machine learning algorithms are implemented by SciKit-Learn 0.21.3. The experiments are conducted on Ubuntu 18.04, Intel Core i9-9900K CPU@3.6 GHZ×16 processors and GeForce RTX 2080 GPU are utilized to accelerate the training of neural networks.
To compare different spam detection algorithms, we utilize accuracy, precision, recall, F1-score, and AUC as the metrics. For binary classification problems, the classification results are evaluated by four metrics. True positive (TP) is the number of correctly labeled positive samples and true negative (TN) is the number of correctly labeled negative samples. False positive (FP) and false negative (FN) present the number of incorrectly identified negative and positive samples. These metrics can be combined as other efficient metrics according to Equations(7)–(11).

$Accuracy= \frac{TP+TN}{TP+TN+FP+FN}$(7)

$Precision= \frac{TP}{TP+FP}$(8)

$Recall/TPR= \frac{TP}{TP+FN}$(9)

$F1= \frac{2×Precision×Recall}{Precision+Recall}$(10)

$FPR= \frac{FP}{FP+TN}$(10)

Receiver operating characteristic (ROC) curve can measure the changing of classification performance with different thresholds. It accepts the positive probability generated from classifiers and draws a curve by false positive rate (FPR) and true positive rate (TPR). The area under ROC curve is denoted as AUC which is exploited as a comprehensive metric in classification tasks.
The structure details of our proposed model are shown in Table 4, which is designed for 5k-random dataset as an example. FC module is a concatenation of fully connected layer, batch normalization layer, ReLu activation layer and dropout layer. The number of input neurons and convolution kernels will be adjusted for different datasets.

Table 4. Model structure details of ABC for 5k-random dataset
Layer Setting # parameters Notes
Conv1d*5 kernel size = 2 3×5 Feature extraction
Conv1d*5 kernel size = 3 4×5
FC module output units = 128 15104+256 Encoder
FC module output units = 64 8256+128
FC module output units = 10 650+20
FC module output units = 64 704+128 Decoder
FC module output units = 128 8320+256
Linear output units = 12 1548
Sigmoid activation function 0
Bilinear output units = 128 18560 Bilinear mapping
FC module output units = 64 8256+1280 Forward network
Linear output units = 1 65
Sigmoid activation function 0
Total - 62414 -

Experiments on Honeypot Dataset
Firstly, different detection methods are evaluated with Social Honeypot dataset. Five original features and four extended features are extracted for spam detection as shown in Fig. 2. There are 22223 spammers in Honeypot, but the user ids of some legitimate records are repeated. After removing duplicate records, 19232 legitimates are remained in the dataset.
As the summary in Table 2, current studies about spam detection are conducted on different datasets and feature sets. Due to the difficulty of accessing original source codes and datasets of these methods, we compare the proposed model (ABC) with the basic models in recent studies, with little modification for fitting our datasets. The scripts using in this study are publicly available in GitHub (https://github.com/zhangsunny/Twitter-Spam-Detection).
The Adam optimizer is utilized for parameter updating with a learning rate of 0.01. The training steps are fixed to 500 and the batch size is 1024. The number of neurons in the middle layer of the auto-encoder is 10. Other hyper-parameters of ABC are shown in Table 5. As introduced in Section 3, multiple convolution kernels are employed to extract local features. Convolution kernels with small size are preferred, because multiple smaller convolution kernels can replace the larger one, and reduce the number of parameters significantly. The stride length and channel number are fixed to 1. The kernel size is adjusted for different datasets. b convolution kernels with size a are denoted as [a]×b in the column of kernel size. A linear kernel is employed in SVM and other parameters keep default settings. The parameters of other classifiers from SciKit-Learn module also follow the default settings.
To evaluate the extended features, all the models are training with the original and extended datasets. They are split into the training set and testing set in the ratio of 6:4. The detailed classification results are shown in Tables 6 and 7.

Table 6. Classification performances on original Honeypot dataset
Basic model Accuracy Precision Recall F1-score AUC Execution time (s)
ABC 88.45 88.23 90.53 89.37 94.61 283.27
KNN 85.41 86.74 86.93 86.33 90.92 0.75
DT [16] 85.57 86.31 86.87 86.59 85.49 0.07
NB [30] 53.65 53.63 99.93 69.8 76.52 0.02
SVM [31] 70.14 72.76 73.48 73.12 78.71 52.9
CNN [33] 88.37 88.7 89.74 89.22 94.52 157.54

Table 7. Classification performances on extended Honeypot dataset
Basic model Accuracy Precision Recall F1-score AUC Execution time (s)
ABC 96.09 96.06 96.67 96.37 99.58 (5.25%↑) 294.52
KNN 84.3 87.61 82.36 84.91 91.04 (0.13%↑) 1.13
DT [16] 94.2 94.95 94.18 94.57 94.20 (10.19%↑) 0.11
NB [30] 69.3 64.05 97.41 77.28 88.17 (15.22%↑) 0.03
SVM [31] 85.91 87.57 85.9 86.73 92.45 (17.46%↑) 38.98
CNN [33] 92.73 97.64 88.58 92.89 98.60 (4.32%↑) 159.35
According to the experimental results on original and extended honeypot dataset, ABC has the best performances on different classification metrics. It is significant that all the classifiers have an improvement on extended dataset, and the performance of SVM has obtained the maximum promotion. The kernel function is employed in SVM to project samples into a high dimension space for linearly separable, and the essence of decision tree is feature subspace partition. The extended features have provided more discriminative information from different views to help identify spam data. However, the F1-score of KNN is reduced on extended dataset. Euclidean distance is employed in KNN to measure the difference between samples, and it equates the contributions of different features, that cannot meet the actual requirements. The results of KNN also illustrate the importance of further representation learning on the basis of hand-crafted features.

Experiments on 6 Million Dataset
The second dataset, denoted as 6 Million, is employed for evaluation and analysis. The comparative experiments are conducted on two selected sub-datasets of 6 Million, named 5k-continuous and 5k-random. There are 10,000 records in each dataset, including 5,000 spam samples. The former is continuously collected from the whole dataset, while the latter is randomly selected. Each record has 12 features. Since 6 Million dataset is mainly collected for detecting spam tweets, user-based and content-based features are combined, in which the first six features are about the basic information of twitter account and the last six features are extracted from the content of each tweet.

Table 8. Performances on 5k-continuous dataset
Basic model Accuracy Precision Recall F1-score AUC Execution time (s)
ABC 89.35 91.42 86.85 89.08 95.68 80.11
KNN 85.82 86.76 84.55 85.64 91.95 0.29
MRF [10] 87.92 90.41 84.85 87.54 94.75 326.18
DT [16] 92.15 92.36 91.9 92.13 92.15 0.02
NB [30] 73.7 69.59 84.2 76.2 78.13 0.01
SVM [31] 76.08 71.72 86.1 78.25 81.87 3.41
CNN [33] 84.27 82.44 87.1 84.71 93.08 39.33

Table 9. Performances on 5k-random dataset
Basic model Accuracy Precision Recall F1-score AUC Execution time (s)
ABC 79.15 79.3 78.9 79.1 87.61 80.17
KNN 73.68 72.64 75.95 74.26 80.8 0.38
MRF [10] 76.72 75.61 78.9 77.22 85.02 248.01
DT [16] 78.15 77.73 78.9 78.31 78.15 0.03
NB [30] 63.75 65.01 59.55 62.16 69.25 0.01
SVM [31] 65.5 62.76 76.25 68.85 69.59 4.39
CNN [33] 77 74.06 83.1 78.32 83.86 38.06
All the features are scaled into [0, 1] by the min-max normalization for better decoding. Four datasets are split into training and testing set in the ratio of 6:4. The experimental results are shown in Tables 8 and 9. It is clear that 5k-random dataset is more difficult for binary classification than 5k-continuous dataset. For 5k-continuous dataset, decision tree has the highest F1-score, while ABC has the highest AUC value. For 5k-random dataset, our proposed method has the best performances for all classification metrics. Considering the classification difficulty of 5k-random dataset, ABC is more suitable for complex modeling problem.

Reconstruction Visualization
In our proposed model, the auto-encoder is employed to model the distribution of normal pattern and generate reconstructed representation for further classification task. Besides, the cosine reconstruction loss is proposed to optimize the objective function that increases the similarity between original and reconstructed representations for normal labels, while reinforce the difference for spam ones. Therefore, the reconstruction error is higher enough to be utilized as an indicator for spam detection.
To evaluate the effect of the auto-encoder and cosine reconstruction loss, 1,000 samples for normal and spam labels selected from 5k-random dataset are input into the auto-encoder to collect reconstructed vector representations. T-distributed stochastic neighbor embedding (TSNE) [36], a well-known manifold learning tool, is performed on reconstructed representations to embed them into a two-dimensional vector space. As shown in Fig. 5, there is a clear difference between normal and spam data distribution after learning reconstructed representations, while two categories of samples without reconstruction are mixed and unstructured. This phenomenon proves the auto-encoder can increase the distribution distances between different categories of samples via the optimization of cosine reconstruction loss. The corresponding classification results in Tables 8 and 9 also prove the effectiveness of the auto-encoder and cosine reconstruction loss.

Fig. 5. Visualization for (a) reconstructed and (b) original data distributions by TSNE.

Ablation Study
As described above, the bilinear layer is employed to model the pairwise interactions and capture the difference between the reconstructed and original representation. The ablation experiments are performed to analysis the necessity of bilinear layer with a replacement of simple concatenating operation (denoted as concat). We also evaluate the impact of global and local features by different combinations, as shown in Table 10.
Compared with the concatenating operation, the bilinear mapping can improve the performance, whether the local, global, or their combined features are employed. Especially when there are only local features, the increase on F1-score is significant. It is also observed that the combination of global+local+concat has worse results than that of global+concat and global+bilinear. Although the local features are provided, the concatenating operation cannot make full use of supplementary information, even the increase of feature dimensions may reduce the performance of classifier. But when the bilinear layer is exploited, the improvement is observed, that illustrates the positive impact and necessity of bilinear mapping.
In addition, it can be found from the empiric results that the global features have more clues about the task than the local features, when two types of features are employed separately. The local feature does provide additional information different from that carried with global feature, but the abundant features require more suitable architecture for refining, such as the bilinear layer. This phenomenon is also consistent with our motivation, i.e., extracting the local feature as a supplement for the global feature.

Table 10. Ablation experiments on four datasets
Method Original Honeypot Extended Honeypot 5k-random 5k-continous
F1-score AUC F1-score AUC F1-score AUC F1-score AUC
local+concat 74.99 91.48 89.45 98.27 72.29 78.81 84.96 94.55
local+bilinear 86.67 94.2 89.81 98.72 77.44 85.11 88.02 95.05
global+concat 88.55 94.26 95.01 99.01 77.04 85.42 88.28 94.91
global+bilinear 88.68 94.59 96.05 99.24 79.04 86.28 88.37 95.43
global+local+concat 88.11 94.09 90.98 98.36 77.57 84.07 78.7 88.21
global+local+bilinear 89.37 94.61 96.37 99.58 79.1 87.61 89.08 95.68

Parameter Analysis
The auto-encoder has a compression process of information, where the input features are refined to retain the task-relevant information, while discard the noises. The number of middle neurons has a direct impact on this process. A small number of middle neurons may cause the loss of necessary information, and a large number of that could keep irrelevant information into the intermediate representation, which would hamper the recognition of classifiers. Convolution kernels are employed for local feature extraction in our study, and the number of kernels also has a vital impact on the model performance. Therefore, we have conducted the experiments to evaluate the impact of two parameters on execution time and AUC metric, as shown in Fig. 6.

Fig. 6. Parameter analysis for (a) convolutional kernels and (b) the number of middle neurons.

Fig. 6(a) have illustrated the impact of kernel number. The execution time is positively and linearly correlated with the number of kernels. Although the convolution operation is a time-consuming process, it has a parallel implementation in PyTorch framework, which can accelerate this process. According to Fig. 6(b), there is no direct relationship between the number of middle neurons and execution time. Because execution time usually has a positive correlation with the number of parameters. The increase of middle neurons is insignificant, compared with the overall number of model parameters.
For 5k-random dataset, AUC value is sensitive to the setting of two parameters, while it is more stable for 5k-continuous dataset. Considering the classification difficulty of two datasets, it is concluded that two parameters should be adjusted carefully for difficult dataset. The number of kernels and neurons should be reduced, while ensuring the classification performance.

Discussion
Through the qualitative and quantitative experiments in this section, the performance of our proposed model has been evaluated and proved. Compared with existing methods based on different models, here are some conclusions from the empiric results:
(1) When the sophisticated manufactured features are available, classic machine learning methods can achieve competitive performance with deep learning methods. And under the same condition, machine learning methods usually have less execution time, while deep learning methods require multi-round iterative optimization.
(2) Deep learning methods can simplify the process of feature extraction, training and inference, but they depend heavily on the quality and quantity of training samples. Under ideal conditions, deep learning methods have stronger ability of representation learning and pattern recognition, which make it more suitable for complex modeling problem.
(3) Although automatic representation learning has dominated in the community, feature engineering still plays an important role in spam detection. Manufactured features and automatic representation learning can jointly improve the performance with appropriate architectures.

Conclusion

In this work, we propose the auto-encoding bilinear classifier to detect twitter spam based on the reconstruction error. Firstly, multiple convolution kernels are employed for local feature extraction and provide supplementary information from different views. Then, the local and global features are input into the auto-encoder for learning the reconstructed representation. The auto-encoder is expected to capture the patterns of normal data, and reconstructed correctly the original input sample if it meets the normal pattern. Consequently, the difference between the original and reconstructed representation is an indicator for spam detection. The bilinear mapping is utilized to capture the pairwise difference between reconstructed and original representations. A feed forward network, composed by stacked fully connected layers, is employed to identify spam data according to the evidence from bilinear mapping. Empiric results on Social Honeypot and 6 Million Spam Tweets datasets have demonstrated our proposed model have better performances on several classification metrics.
However, our method relies on the supervised information from labels. There are massive unlabeled data existing in social network, and the manual tagging process is time-consuming and labor-intensive. In future work, we plan to introduce a deep generative model and explore the unsupervised method based on graph theory for social network security.

Author’s Contributions

Conceptualization, QH, CY. Funding acquisition, CY. Investigation and methodology, SZ, CY.Writing of the original draft, QH, SZ, BL.Writing of the review and editing, QH, SZ, BL.

Funding

This work was funded by the National Natural Science Foundation of China (No. 61772282). It was also supported by the Postgraduate Research & Practice Innovation Program of Jiangsu Province (No. KYCX21 1008).

Competing Interests

The authors declare that they have no competing interests.

Author Biography

Name : Qian He
Affiliation : College of Information Science and Electronic Engineering, Hunan City University
Biography : Qian Hereceived his Ph. D. degree in Applied Mathematics from Hunan University, China in 2016. His research interests are in the areas of medical image reconstruction,network security,image processing and difference equations.

Name : Sun Zhang
Affiliation : School of Computer andSoftware, Nanjing University of Information Science and Technology
Biography : Sun Zhangreceived the B.S. and M.S. degree in computer science and technology from Nanjing University of Information Science and Technology, China, in 2015 and 2018, respectively. Now he is pursuing his Ph.D degree in there. His current research interests include deep learning, social network security, and sentiment analysis.

Name : Bo Li
Affiliation : School of Computer andSoftware, Nanjing University of Information Science and Technology
Biography : Bo Lireceived his bachelor degree in Jining Medical University, China. Now he is studying for his master degree in Nanjing University of Information Science and Technology, China. His current research interests are machine learning and network security.

Name : Chunyong Yin
Affiliation : School of Computer and Software, Nanjing University of Information Science and Technology
Biography : Chunyong Yin received the B.S. degree from Shandong University of Technology, China in 1998. He received the M.S. and Ph.D. degrees in computer science from Guizhou University, China, in 2005 and 2008, respectively. He was a post-doctoral research associate at the University of New Brunswick, Canada, in 2011 and 2012. He is currently a Professor and Dean with the Nanjing University of Information Science and Technology, China. He has authored or coauthored more than twenty journal and conference papers. His current research interests include privacy preserving and sensor networking, machine learning and network security.

References

[1] G. Lingam, R. R. Rout, D. V. L. N. Somayajulu, and S. K. Ghosh, “Particle swarm optimization on deep reinforcement learning for detecting social spam bots and spam-influential users in twitter network,” IEEE Systems Journal, vol. 15, no. 2, pp. 2281-2292, 2021.
[2] M. Alazab and R. Broadhurst, “Spam and criminal activity,” Trends and Issues in Crime and Criminal Justice, vol. 2016, no. 526, pp. 1-20, 2016.
[3] A. Jamil, K. Asif, Z. Ghulam, M. K. Nazir, S. M. Alam, and R. Ashraf, “MPMPA: a mitigation and prevention model for social engineering based phishing attacks on Facebook,” in Proceedings of2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, 2018, pp. 5040-5048.
[4] R. Zafarani and H. Liu, “10 bits of surprise: Detecting malicious users with minimum information,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, 2015, pp. 423-431.
[5] C. Chen, J. Zhang, X. Chen, Y. Xiang, and W. Zhou, “6 million spam tweets: a large ground truth for timely Twitter spam detection,” in Proceedings of 2015 IEEE International Conference on Communications (ICC), London, UK, 2015, pp. 7065-7070.
[6] M. Fazil and M. Abulaish, “A hybrid approach for detecting automated spammers in Twitter,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, pp. 2707-2719, 2018.
[7] D. Seyler, S. Tan, D. Li, J. Zhang, and P. Li, “Textual analysis and timely detection of suspended social media accounts,” in Proceedings of the International AAAI Conference on Web and Social Media (ICWSM), Atlanta, GA, 2021, pp. 644-655.
[8] J. Cheng, M. Bernstein, C. Danescu-Niculescu-Mizil, and J. Leskovec, “Anyone can become a troll: causes of trolling behavior in online discussions,” in Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, Portland, OR, 2017, pp. 1217-1230.
[9] M. Washha, A. Qaroush, M. Mezghani, and F. Sedes, “Unsupervised collective-based framework for dynamic retraining of supervised real-time spam tweets detection model,” Expert Systems with Applications, vol. 135, pp. 129-152, 2019.
[10] N. El-Mawass, P. Honeine, and L. Vercouter, “SimilCatch: enhanced social spammers detection on Twitter using Markov random fields,” Information Processing & Management, vol. 57, no. 6, article no. 102317, 2020. https://doi.org/10.1016/j.ipm.2020.102317
[11] J. C. S. Sicato, S. K. Singh, S. Rathore, and J. H. Park, “A comprehensive analyses of intrusion detection system for IoT environment,” Journal of Information Processing Systems, vol. 16, no. 4, pp. 975-990, 2020.
[12] C. Yin, S. Zhang, J. Wang, and N. N. Xiong, “Anomaly detection based on convolutional recurrent autoencoder for IoT time series,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 52, no. 1, pp. 112-122, 2022.
[13] M. Z. Asghar, F. Subhan, H. Ahmad, W. Z. Khan, S. Hakak, T. R. Gadekallu, and M. Alazab, “Senti‐eSystem: a sentiment‐based eSystem‐using hybridized fuzzy and deep neural network for measuring customer satisfaction,” Software: Practice and Experience, vol. 51, no. 3, pp. 571-594, 2021.
[14] J. D. Lee, H. S. Cha, S. Rathore, and J. H. Park, “M-IDM: a multi-classification based intrusion detection model in healthcare IoT,” Computers, Materials and Continua, vol. 67, no. 2, pp. 1537-1553, 2021.
[15] S. Rathore and J. H. Park, “A blockchain-based deep learning approach for cyber security in next generation industrial cyber-physical systems,” IEEE Transactions on Industrial Informatics, vol. 17, no. 8, pp. 5522-5532, 2020.
[16] Y. Tan, Q. Wang, and G. Mi, “Ensemble decision for spam detection using term space partition approach,” IEEE Transactions on Cybernetics, vol. 50, no. 1, pp. 297-309, 2020.
[17] K. Lee, J. Caverlee, and S. Webb, “Uncovering social spammers: social honeypots + machine learning,” in Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Geneva, Switzerland, 2010, pp. 435-442.
[18] F. Masood, A. Almogren, A. Abbas, H. A. Khattak, I. U. Din, M. Guizani, and M. Zuair, “Spammer detection and fake user identification on social networks,” IEEE Access, vol. 7, pp. 68140-68152, 2019.
[19] K. Shu, S. Dumais, A. H. Awadallah, and H. Liu, “Detecting fake news with weak social supervision,” IEEE Intelligent Systems, vol. 36, no. 4, pp. 96-103, 2021.
[20] S. Hakak, M. Alazab, S. Khan, T. R. Gadekallu, P. K. R. Maddikunta, and W. Z. Khan, “An ensemble machine learning approach through effective feature extraction to classify fake news,” Future Generation Computer Systems, vol. 117, pp. 47-58, 2021.
[21] A. Makkar and N. Kumar, “An efficient deep learning-based scheme for web spam detection in IoT environment,” Future Generation Computer Systems, vol. 108, pp. 467-487, 2020.
[22] X. Tang, T. Qian, and Z. You, “Generating behavior features for cold-start spam review detection with adversarial learning,” Information Sciences, vol. 526, pp. 274-288, 2020.
[23] P. K. Roy, J. P. Singh, and S. Banerjee, “Deep learning to filter SMS spam,” Future Generation Computer Systems, vol. 102, pp. 524-533, 2020.
[24] Y. Liu, B. Pang, and X. Wang, “Opinion spam detection by incorporating multimodal embedded representation into a probabilistic review graph,” Neurocomputing, vol. 366, pp. 276-283, 2019.
[25] J. H. Park, M. M. Salim, J. H. Jo, J. C. S. Sicato, S. Rathore, and J. H. Park, “CIoT-Net: a scalable cognitive IoT based smart city network architecture,” Human-centric Computing and Information Sciences, vol. 9, article no. 29, 2019. https://doi.org/10.1186/s13673-019-0190-9
[26] P. K. Sharma, J. H. Park, Y. S. Jeong, and J. H. Park, “SHSec: SDN based secure smart home network architecture for Internet of Things,” Mobile Networks and Applications, vol. 24, no. 3, pp. 913-924, 2019.
[27] J. C. S. Sicato, P. K. Sharma, V. Loia, and J. H. Park, “VPNFilter malware analysis on cyber threat in smart home network,” Applied Sciences, vol. 9, no. 13, article no. 2763, 2019. https://doi.org/10.3390/app9132763
[28] S. Madisetty and M. S. Desarkar, “A neural network-based ensemble approach for spam detection in Twitter,” IEEE Transactions on Computational Social Systems, vol. 5, no. 4, pp. 973-984, 2018.
[29] I. Inuwa-Dutse, M. Liptrott, and I. Korkontzelos, “Detection of spam-posting accounts on Twitter,” Neurocomputing, vol. 315, pp. 496-511, 2018.
[30] A. T. Kabakus and R. Kara, ““TwitterSpamDetector”: a spam detection framework for Twitter,” International Journal of Knowledge and Systems Science, vol. 10, no. 3, pp. 1-14, 2019.
[31] S. B. S. Ahmad, M. Rafie, and S. M. Ghorabie, “Spam detection on Twitter using a support vector machine and users’ features by identifying their interactions,” Multimedia Tools and Applications, vol. 80, no. 8, pp. 11583-11605, 2021.
[32] S. B. Abkenar, E. Mahdipour, S. M. Jameii, and M. HaghiKashani, “A hybrid classification method for Twitter spam detection based on differential evolution and random forest,” Concurrency and Computation: Practice and Experience, vol. 33, no. 21, article no. e6381, 2021. https://doi.org/10.1002/cpe.6381
[33] K. Kiruthika Devi and G. A. Sathish Kumar, “Stochastic gradient boosting model for Twitter spam detection,” Computer Systems Science and Engineering, vol. 41, no. 2, pp. 849-859, 2022.
[34] Y. Liu, L. Wang, T. Shi, and J. Li, “Detection of spam reviews through a hierarchical attention architecture with N-gram CNN and Bi-LSTM,” Information Systems, vol. 103, article no. 101865, 2022. https://doi.org/10.1016/j.is.2021.101865
[35] X. Zheng, J. Han, and A. Sun, “A survey of location prediction on Twitter,” IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 9, pp. 1652-1671, 2018.
[36] A. Lensen, B. Xue, and M. Zhang, “Genetic programming for evolving a front of interpretable models for data visualization,” IEEE Transactions on Cybernetics, vol. 51, no. 11, pp. 5468-5482, 2020.

Qian He1, Sun Zhang2, Bo Li2, and Chunyong Yin2,*, Twitter Spam Detection via Bilinear Autoencoding Reconstruction Error, Article number: 12:27 (2022) Cite this article 2 Accesses