홈으로ArticlesAll Issue
ArticlesPreserving the Privacy of Dependent Tuples Using Enhanced Differential Privacy
  • Saqib Ali1,2,*, Guojun Wang1,*, Shazia Riaz2,3, and Tayyaba Rafique2

Human-centric Computing and Information Sciences volume 12, Article number: 43 (2022)
Cite this article 1 Accesses
https://doi.org/10.22967/HCIS.2022.12.043

Abstract

Nowadays, social media applications have become a prominent source for getting users' data containing sensitive personal information; hence, the possible violation of an individual's privacy. Differential privacy (DP) is a globally recognized mathematical tool for dealing with these privacy concerns but fails to preserve the privacy of dependent tuples significantly. Various solutions have been proposed to address this issue by investigating different privacy parameters. Still, some of these privacy parameters make assumptions about the adversary's background knowledge or provide an inefficient solution regarding their accuracy and utility. To resolve these shortcomings, the proposed enhanced DP-based solution uses the maximal information coefficient (MIC) to calculate the dependencies between the tuples and conceals users' information by incorporating noise according to the calculated degree of correlation. The performance of the enhanced DP is evaluated on real-world datasets. Furthermore, an inference attack is generated on the query results to validate the effectiveness of the proposed solution. It provides rigorous privacy guarantees by leaking less information than the standard DP in the presence of dependent tuples. Compared with other mechanisms, it ensures better performance and data utility with minimum privacy cost.


Keywords

Differential Privacy, Dependent Tuples, Correlation, Maximal Information Coefficient, Inference Attack


Introduction

Different organizations and online social networks (OSNs) such as Facebook, Instagram, LinkedIn, etc., collect a massive amount of users' data shared on their sites. These companies use this data to monitor the platform's performance, discover market trends, understand customer preferences, direct targeted ads, recommend systems, etc. Furthermore, these OSNs are significantly transforming the way individuals cooperate, becoming a transcendent method of associating, cooperating, imparting, and sharing data online. With such quick improvement of OSNs, sharing information is gradually becoming simpler. However, data sharing can give rise to a serious privacy concern if not handled with care. It may contain users' information that users do not want to disclose. Therefore, it is necessary to preserve an individual's privacy when the data is publicized for providing utility. Typically, the guarantee of individual privacy is considerably greater than concealing critical and sensitive attributes (e.g., name, pay, address) of information records through de-anonymization. These de-anonymization mechanisms may violate privacy due to the dependencies among critical attributes.
Different privacy-preserving frameworks have been introduced to address the privacy issue. Among them, differential privacy (DP) is a widely accepted privacy model [1]–[4]. As a precise mathematical standard for achieving the privacy of query results executed on statistical databases giving verifiable privacy assurance to people, DP expects that, regardless of whether an adversary knows all records in a dataset, except for one data record, it is impossible to retrieve the information of the targeted unknown record. DP's rigorous framework asserts that insertion or deletion of a single record marginally affects the probability distribution of query results. For example, two neighboring datasets differing by one record produce nearly identical query result distributions. Thus, after getting query answers, the adversary is unable to infer sensitive information of the targeted record in the database.
Although a commonly known framework used for privacy preservation, DP assumes that all records in a dataset are independently drawn from a representative population. Actually, there is strong correlation between records (known as dependent tuples) in most real-world datasets due to the existence of social, behavioral, and genetic interaction among them. Thus, some records show strong correlation properties that may release privacy information beyond expectation. For example, in a social network graph, users constantly interact with each other, and deletion of one node from the graph can be inferred from the rest of the graph [5]. The private features of a targeted user can be determined by manipulating the public features of the users having the same interests [6]. Another example presumes that, if a person in a family gets infected by a viral disease, other persons in the family are highly likely to suffer from the same disease. Thus, it enables an adversary to determine a specific person's health by having information about any of the medical records of the person’s family member.
Kifer and Machanavajjhala [7] are the pioneers who discovered that DP may not be efficient in preserving privacy in the case of dependent tuples. The leading cause is that DP hides the presence or absence of the records in the dataset but not the statistical information inferred from the dependent tuples of the dataset. This characteristic difference degrades the level of privacy expected from DP. Later, the same authors introduced Pufferfish, another privacy framework that considers the dependencies among the correlated tuples [8]. However, it does not suggest any perturbation technique to meet the prerequisite of DP. Afterward, the researchers introduced a perturbation technique to cope with this issue and used Laplace noise to perturb the correlated records. Nevertheless, this technique injects a lot of noise during perturbation, which may modify the query results to the extent that degrades the dataset's utility. Thus, DP is unable to preserve the privacy of dependent tuples while providing maximum utility in the correlated datasets. As such, the proposed mechanism handles the dependency constraint between the tuples by adding Laplace noise to the query responses according to the degree of correlation (dependence) between them while providing rigorous privacy guarantees. The proposed mechanism computed the dependence between tuples using the maximal information coefficient (MIC) in a fine-grained manner. It is a standard technique for discovering the nonlinear functional dependencies between the variables of large datasets. Moreover, MIC is easy to interpret compared to other correlation coefficients. Therefore, in this work, the calculated dependency coefficient is interpreted as the ratio of dependent indistinguishability of a tuple, which is the maximum change in a query output due to the modification of a tuple. Furthermore, we generate an inference attack on differentially private query results by utilizing the dependencies between the tuples of real-world datasets. The experimental results demonstrate the capability of the proposed model in ensuring the privacy guarantee of dependent tuples while providing maximum utility. In summary, our paper makes the following contributions:
1. This work quantifies the dependencies between the tuples in correlated datasets using the MIC to address the private information leak due to dependent tuples.
2. We enhance the utility of the dependent tuples by adding reduced noise to the query responses according to the degree of correlation between them while providing rigorous privacy guarantees.
3. We demonstrate the effectiveness of the proposed mechanism by generating an inference attack on differentially private query results of real-world datasets containing dependent tuples.
The rest of this paper is organized as follows: Section 2 reviews related literature, Section 3 illustrates the preliminaries to formalize the DP, Section 4 and Section 5 demonstrate the system overview and system model, respectively; Experimental evaluation is conducted in Section 6. Finally, Section 7 concludes the paper.


Related Work

This section reviews the literature related to the role of DP in preserving the privacy of the datasets containing dependent tuples.
The research community suggested various techniques and models to preserve the privacy of individuals in datasets containing sensitive information, e.g., healthcare data privacy [9, 10]. These mechanisms include DP-based privacy [2, 11, 12], homomorphic encryption-based privacy [13], privacy-enhancing technologies [1416] etc. Among these, DP is the most effective framework in ensuring the privacy of individuals [17]. It has received much attention from researchers in the last few years. The remarkable success of DP lies in the belief that datasets usually contain independent tuples [18]. However, it is not the case in most real-world datasets [19]. Kifer and Machanavajjhala [7] fundamentally investigated the privacy guarantees offered by DP in the dependent tuples. They utilized a no-free lunch hypothesis characterizing non-privacy as a game to argue that privacy can hardly be expected without considering the dependencies among the tuples of a dataset. They noted that DP might not give fruitful results in such situation.
Later, Kifer and Machanavajjhala [20] developed the Pufferfish framework that generalized DP for correlated datasets. They checked the impact of dependent tuples on the performance of DP more comprehensively than previously described and found that the background information gained by the adversary and the information extracted from the query results differ significantly; thus, making DP ineffective in the case of correlated datasets. Furthermore, He et al. [21] formalized Blowfish, the subclass of the Pufferfish model that extended the implementation of DP by formalizing the policy constraints for the databases. They divided the information into secrets and constraints that may be known to the adversary. The utility privacy tradeoff was measured by injecting less noise in secrets to maximize the utility. In [22], another DP-based mechanism was derived from the Pufferfish and Blowfish techniques to address the data dependency issue. The authors argued that accessibility compromises privacy while scrutinizing the deficiencies of the baseline approach. Another model was designed by combining the Pufferfish framework with the Bayesian theory [23]. They addressed the privacy issues in high-dimension correlated datasets. The Bayesian theory was implemented to examine the protection provided by the DP in the case of dependent tuples.
Jiang et al. [24] comprehensively investigated the privacy assurance of social network datasets in correlated scenarios and proved that DP and its major variants could guarantee provable privacy. Chen et al. [25] mitigated the dependent tuples problem by multiplying the number of correlated tuples by the computed global sensitivity of the dataset. However, they introduced much noise to deal with correlation problems that spoil the query result; thus, reducing their utility. In [26], the concept of correlated sensitivity was introduced to be used in place of global sensitivity to minimize the amount of noise and enhance utility. Zhao et al. [27] formalized the concept of dependent differential privacy (DDP) to avoid the leak of users' sensitive information that may occur in correlated datasets. This DDP guarantee can be applied to any information relationship and is free of background knowledge. Wang and Wang [28] presented the notion of correlated tuple differential privacy (CTDP) and achieved its privacy guarantees by the generalized Laplace mechanism (GLM). The aforesaid techniques overcome the impediments of DP under dependent tuples. These mechanisms were computationally efficient, accomplishing higher utilities than earlier work. br/> Lv and Zhu [29] tackled the correlation problem in big data by introducing the r-correlated block differential privacy (r-CBDP) mechanism. Their presented mechanism uses machine learning and MIC to verify the dependencies between the dependent tuples to preserve their privacy efficiently. Drakonakis et al. [30] designed LPAuditor, a tool for performing privacy loss evaluation on a public location dataset. They preserved the privacy of a dataset by adding noise to its histogram under the privacy budget. A correlation reduction scheme was introduced to deal with the privacy loss issue in machine learning algorithms trained on the correlated dataset [32]. The authors employed a differentially private feature selection method to alleviate the privacy leak due to correlated tuples. Chen et al. [27] developed a differentially private quasi-identifier classification (DP-QIC) technique for big data publication. The presented approach conceals the correlated attributes after evaluating the data set vulnerability by employing quasi-identifier classification based on the privacy ratio of attributes. In [19], the authors used Gibbs dependence to measure the correlation among the dataset records. However, the authors theoretically proved the result and did not implement the developed method on a real-world dataset. Most of the aforesaid techniques handle the privacy issue in correlated datasets by using either global sensitivity, correlated sensitivity, or correlation coefficient. Despite these efforts, such amendments either inject a lot of noise that degrades their utility or inefficiently provide the privacy guarantee of the query result. Consequently, there is a need for effectively measuring the sensitivity to tackle its impact on the correlated datasets. Therefore, this work uses MIC to discover accurately the dependencies among tuples of the dataset so that the correlated sensitivity can be precisely calculated. This paper focuses on providing maximum utility while preserving privacy by adding less noise according to the degree of dependencies of the correlated datasets.


Preliminaries

In this section, we explain the terms used in the formulation of the proposed model based on the concept of DP. Table 1 lists the notations used in the proposed formulation.
Table 1. Notations used in the formulation of the enhanced DP mechanism for dependent tuples

Notation Explanation
DS Dataset; training sample set
DS1, DS2 Neighboring datasets
M Differential privacy mechanism
S Set of outputs
ε Privacy budget
δ Privacy parameter
GS Global sensitivity
F Query function
A Laplace mechanism
Ϻ Maximal information coefficient mechanism
Dependency coefficient
N Total number of records
L Number of correlated records
K Number of clusters


Differential Privacy
DP provides a rigorous framework to preserve an individual's information privacy. It publishes the query response over a dataset while preserving an individual's privacy and guarantees that changing a single tuple in a dataset will be indistinguishable from the query response [17]. Thus, an adversary cannot get extra information about an individual record from the aggregated query results. It is formally defined as:

Definition 1( Differential Privacy [3]). Suppose two datasets DS1 and D$S_2$ are adjacent with the difference of only a single tuple between them. Then for every subset of output S, a randomized algorithm M will provide ε-DP if M satisfies the following:

(1)




where ε denotes the amount of privacy budget required to achieve the desired privacy and small ε value indicates a higher level of privacy. DP achieves its goal of privacy preservation by adding calibrated noise to perturb the query results. The amount of noise added for perturbation depends on the function's sensitivity. The maximum change obtained by changing a single record in the dataset is global sensitivity, which is defined as follows:

Definition 2 (Global Sensitivity [3]). Consider any query function f: DS→R executed on a dataset DS, then global sensitivity denotes the maximal difference of the outputs of function f with single index changes (DS1 and DS2 differing in the single record). Formally,

(2)


where ||f(DS1)–f(DS2)||1 denotes the L1 norm of the function. Global sensitivity determines the noise magnitude required to fulfill the requirements of ε-DP. The noise required for perturbation is drawn using the Laplace mechanism to ensure privacy. The Laplace noise is added to the output of query results according to the global sensitivity of the query function.

(3)


It achieves ε-DP, where Lap is called Laplace noise drawn from Laplace distribution with scaling factor and center 0.

Maximal Information Coefficient
Reshef et al. [33] introduced the MIC algorithm to calculate the correlation between two random variables. MIC is universally pertinent; it measures the linear as well as nonlinear correlation between the tuples in the datasets. Thus, it is most commonly used in big data correlation analysis. MIC employs binning as a method to implement mutual information on continuous random variables of all grids after selecting the bins. The probability distribution of the scatter produced by these random variables determines the maximum mutual information in different grids. This mutual information is normalized after obtaining the feature matrix, and the maximal information coefficient is defined by the maximum feature matrix value [34, 35].

Definition 4 (Maximal Information Coefficient). Consider two random variables A and B in dataset DS forming n number of scatter points, then MIC is defined as follows:

(4)


where denotes the grid size with i, j representing the grid's column and row division, respectively. The MIC is regarded as a 21st-century correlation technique that measures linear and nonlinear relationships between two variables. MIC explores and classifies the complicated dependencies between pairs of variables in large data sets. It captures a wide range of exciting relationships and provides an index that systematically discovers all types of correlations between different pairs of variables. Therefore, this work uses MIC to measure the degree of correlation between the dependent tuples of OSNs datasets.

Inference Attack
The inference issue was first discussed in statistical datasets, where accumulated data of a group revealed the information about the specific individual. Afterward, these inference problems moved to social databases in the 1980s. Since then, the inference issue has attacked different types of information and databases, such as multi-level database frameworks, object-oriented databases, redistributed encoded data sets, XML databases, semi-organized databases, media databases, social networks geo-located information, etc. [36].
The inference attack [37] aims to discover whether a data record is present in a dataset by exploiting different authorized query outputs due to vulnerabilities. These vulnerabilities are present due to the unavoidable interconnections between sensitive and public data. Thus, it can give rise to serious privacy concerns for individuals. These concerns are undeniable in online businesses, e-government, blockchain-based health supply chains [10], etc. These databases contain quite sensitive and critical information. For example, the presence of a specific patient's medical record in a particular disease dataset reveals that the patient suffers from such disease. In this paper in particular, we have generated an inference attack to check the validity of the proposed enhanced DP framework to ensure the privacy of the targeted dataset.


System Overview

In this section, we propose an efficient mechanism to preserve the privacy of datasets containing dependent tuples by applying enhanced DP to the query results of these datasets. Furthermore, we evaluate the efficiency of our mechanism by launching an inference attack on these datasets.

Fig. 1. Systematic flow of the proposed mechanism.


We use publicly available datasets for our research, i.e., Facebook, Twitter, and MySpace. As a first step, the datasets are cleaned by using the regex framework. Afterward, all the datasets are hashed using the SHA-256 algorithm to put all datasets in a single format. After cleaning and hashing, we add Laplace noise to query results proportional to the degree of dependence among the tuples. Finally, we generate an inference attack to determine the privacy guarantee of our proposed framework. Fig. 1 elaborates on the working of the complete system.

Dataset Description
It is often challenging to recognize dependent tuples and their associated degree of correlation in a specific dataset. At the same time, real-world datasets show strong correlation between tuples, and their release may give rise to serious privacy challenges. Therefore, we extract freely accessible users' social datasets available online to investigate information leak due to dependent tuples. These datasets contain different fields of users' real-time data from the last few years. We have used the datasets of Facebook, Twitter, and MySpace. These networks provide a platform for people to interact with each other, resulting in interdependencies among individuals. These datasets cover a long period of time and a wider range of users' attributes. We put these datasets into the same format after cleaning.
The following is a brief description of the datasets used in our experiment.


4.1.1 Facebook dataset
The Facebook dataset has an extraordinary case of correlation due to immense interactions among individuals, i.e., users may have the same friends, every user may like or dislike other users' posts, etc. Therefore, there are strong dependencies between tuples of said dataset. We use a freely available Facebook dataset (https://www.kaggle.com/datasets/sheenabatra/facebook-data) to capture the dependencies among its tuples. This dataset contains 99903 tuples with 15 attributes.


4.1.2 Twitter dataset
The Twitter dataset also contains strong dependencies between the tuples in the form of followers, followed, tweets containing mentions and retweets, etc. Sanders Analytics made the Twitter sentiment dataset (https://github.com/zfz/twitter_corpus), which consists of 5,513 classified tweets (however, 400 tweets are missing because of the organization's contents). Each tweet is ordered with regard to one of four distinct topics. The tweets in this dataset were sent between February 2012 and May 2018, with most of them posted from 2015 through 2017.


4.1.3 MySpace dataset
We also experimented with the dataset from MySpace (https://www.chatcoder.com/data.html), another social networking platform that offers open access to user profiles. The user profile consists of public attributes such as age, sex, location, photo, number of friends, etc., and each profile is connected to the other's profile through a friend relationship. Thus, there is correlation among the tuples of the dataset. An adversary can exploit the users' profile data with background knowledge to infer sensitive information about the user and his friends. This dataset consists of 9,000 users' profiles.

Regex Implementation and Format Specification of Datasets
We used the regex (regular expressions) framework for cleaning the datasets. The tuples of these datasets contain different types of attributes with missing values and random data points. Thus, we clean the datasets and put them in the same format to settle these problems and avoid unnecessary records. We used the regex framework available in Python "re" package, which translates the regular expressions into an internal representation that can be executed against the text. Regex or regular expression consists of a sequence of characters describing the pattern used here to validate the dataset's records according to the required criteria. The Regex framework contains the commands used to replace elements such as null values within strings of text. Regex makes cleaning and working with text-based data sets much more manageable and valuable for the analysis.

Hashing of Datasets
We used the SHA-256 algorithm to hash the data using a built-in hashing library of Python to put the datasets in the same format. Due to the large size of the datasets, we have made the chunks of the datasets in such a way that the memory can easily read them. SHA (Secure Hash Algorithm) is a set of cryptographic hash functions used for various security applications. Here, we used hashing to reduce the size of the dataset by using the hash table created as a result of the hashing process. Hash functions transform arbitrary large bit strings (messages) into small, fixed-length bit strings (message digests). These digests can identify the messages from which they are produced with very high probability. Python supports some of the variants of SHA256 algorithms in the "hashlib" library. The "hashlib.algorithms_guaranteed" function is used to explore the available list of supported hash algorithms. SHA 256 uses the 32-bit word and a sequence of 64-bit constants k0{256}, k1{256}, k2{256},…, k63{256}. The code snippet for the hexadecimal output of the hashing is shown in Fig. 2.

Fig. 2. SHA-256 implementation.


Correlation Metric Calculation
According to Kifer and Machanavajjhala [7], an adversary can combine the query response with background knowledge to obtain sensitive information about an individual. It became more evident when there are dependent tuples in a dataset. The DP cannot protect the privacy of datasets in the presence of dependent tuples. In the real world, however, databases are not entirely independent. The different tuples depend on one another, and changing one record can affect many other records in the dataset. Thus, a robust mechanism is required to preserve the datasets' privacy containing dependent tuples while maintaining the acceptable accuracy of query results.
Motivated by the aforesaid problem, we have proposed a mechanism that preserves the privacy of query responses by adding Laplace noise to them in accordance with the degree of correlation among the tuples of the dataset. Simply adding the Laplace noise can lead to noisy query results that ultimately degrade their performance and utility. Therefore, the critical component of our mechanism is to quantify the degree of correlation among tuples. Our mechanism employs MIC to measure the degree of correlation among the tuples of the database. Next, we use this correlation coefficient to measure precisely the sensitivity of the query function. Afterward, the query responses are generated by adding a minimum amount of noise in proportion to the calculated degree of correlation among the tuples to enhance the utility of the query responses in terms of accuracy while efficiently preserving their privacy.
For a given dataset DS with n total number of records, let l be the number of correlated records (l < n), and then MIC is used to determine whether different records of a dataset are correlated or not.

(5)


where $x_i$,$y_j$ are records of DS, between which the correlation metric is calculated.

Clustering of Dataset
We used the clustering approach to divide the dataset into clusters for Laplace mechanism implementation. When a big dataset is used as a single set, the application of the DP to preserve the privacy of a single record may utilize a large amount of privacy budget and computing resources. The division of the dataset into clusters helps find the correlation sensitivity of the query function for each cluster according to the computed MIC. We employ the k-means clustering algorithm for dataset division based on the elbow method, which is an exploratory method used to determine the number of clusters in a data set [38]. This method plots the explained variation as a function of the number of clusters and picks the elbow of the curve as the number of clusters to use. We use the elbow method to compute the k number of clusters from the dataset division and get DS={DS1, DS2,…, DSk}, where each cluster is independent.

Laplace Implementations for Differential Privacy
Many mechanisms available in literature are used to perturb the query results, i.e., Laplace mechanism, Gaussian mechanism, and Exponential mechanism. The Laplace and Exponential mechanisms provide pure ε-DP with zero tolerance or relaxation. In contrast, the Gaussian mechanism provides (ε, δ) DP with some relaxation in the form of δ. The Laplace mechanism is the best one among these mechanisms because the Gaussian mechanism does not provide pure DP, whereas the Exponential mechanism is too generic. It typically gives theoretical lower bounds while its practical implementation demonstrates the same performance as other approaches (i.e., noisy max). Therefore, we employ the Laplace transform function to draw noise according to the computed sensitivity to achieve our enhanced DP mechanism. Mathematically, this scenario is stated as follows:

(6)


DS is considered a statistical database with dependencies between tuples A and B, and s is the query result. The Laplace transform function is applied to the query results according to the degree of dependence between the tuples, and MIC provides the required dependency information. By integrating noise according to the statistical limit, the output is the minimum amount of noise injected into the results of a computational query without any degradation in their utility. Thus, the enhanced equation of DP is as follows:

(7)



where ᶄ is the dependency coefficient that formalizes the equation and adds an appropriate amount of noise into query responses while avoiding leak of an individual's information in the case of dependent tuples.


System Model

The attractiveness of DP stems from the critical point that it is independent of the adversary's computational power and the background information available. DP employs the Laplace perturbation mechanism to ensure the privacy of the users' sensitive data. It provides a strong privacy guarantee because there is no dependent tuple in the dataset. Thus, the mechanism requires a small value of ε when tuples are independent. However, a higher value of ε is needed to ensure privacy when dependent tuples exist in the dataset, which degrades its privacy, as shown in Fig. 3.

Fig. 3. Privacy degradation (due to the higher value of ε) for dependent tuples.


The proposed model with enhanced DP uses a smaller value of privacy budget in the form of (ᶄε) to ensure privacy. The proposed solution uses MIC to measure the degree of correlation among the dependent tuples. Next, we use the computed degree of correlation to calculate the dependency coefficient ᶄ that is finally used to find the appropriate amount of noise so as to ensure the privacy of correlated datasets while providing maximum utility. Thus, it validates the effectiveness of the proposed mechanism. Fig. 4 depicts the privacy model of the enhanced DP mechanism.

Fig. 4. Privacy guarantee (due to the smaller value of ε) for dependent tuples.



Experimental Evaluation

We evaluate the performance of the enhanced DP mechanism on real-world datasets, i.e., Facebook, Twitter, and MySpace. The performance is measured in terms of (1) accuracy maintained while preserving privacy compared to the standard DP approach and (2) amount of information leaked by generating an inference attack.

Accuracy and Privacy Guarantees
The utility of the enhanced DP mechanism is guaranteed by measuring the clustering accuracy of differentially private k-means centroids (u= [u1, u2,……,uk]) of the aforesaid datasets. These centroids are used to share the perturbed query results of machine learning applications trained on Facebook, Twitter, and MySpace datasets. We check the effect of the enhanced DP mechanism on these applications to evaluate their utility in terms of accuracy. The clustering accuracy of the enhanced DP mechanism is measured by applying a cross-validation mechanism with 80% of data selected randomly for training and 20% for validation.

Fig. 5. Results of clustering accuracy for different dataset: (a) Facebook dataset, (b) Twitter dataset, and (c) MySpace dataset.


Fig. 5 demonstrates the clustering accuracy of Facebook, MySpace, and Twitter datasets. The plot of each dataset depicts that the enhanced DP mechanism achieves approximately the same accuracy results as those achieved by applying the standard DP. We evaluate the performance of the enhanced DP mechanism for different values of privacy budget ε on each dataset. The enhanced DP achieves up to 82%, 82.5%, and 81.5% accuracy for Facebook, Twitter, and MySpace datasets, respectively, at ε=1.0, providing a significant privacy guarantee. The proposed mechanism attains this level of accuracy because it incorporates a minimum amount of noise into the centroids according to the computed dependency coefficient (Equation 7). These results signify that the enhanced DP mechanism attains acceptable accuracy while providing a provable privacy guarantee.

Inference Attack Generation
This section presents the membership inference attack on the enhanced DP model. The attack helps check the capability of the proposed model in preserving the privacy of real-world datasets containing dependent tuples. In this attack, a sophisticated adversary is considered to have assumed dependencies among tuples in the datasets. In the absence of a privacy-preserving mechanism, the adversary can extract the required sensitive information by submitting a large number of valid queries on the targeted dataset. Moreover, the adversary exploits the social relationship among users to make better inferences. Thus, a lot of information is leaked when the adversary incorporates the knowledge of correlated tuples in the attacking query. However, a small amount of information can be extracted in the presence of some privacy-preserving mechanism like the standard DP. Therefore, to prove the robustness of the enhanced DP mechanism in achieving the required privacy of correlated datasets, an inference attack is generated on Facebook, Twitter, and MySpace datasets. From Fig. 6, it can be observed that the enhanced DP mechanism resists the attack rigorously and leaks much less information compared to the standard DP.

Fig. 6. Estimation of information leaked due to the standard DP and enhanced DP by applying an inference attack on different datasets: (a) Facebook dataset, (b) Twitter dataset, and (c) MySpace dataset.


Similarly, Fig. 6 demonstrates the difference between information leaked with and without dependence assumption in the case of standard DP. A well-informed adversary with greater background knowledge of dependence among tuples executes a query and gets results that leak much higher information than the leaked information without knowledge of dependence among the tuples. The standard DP without dependent assumption preserves privacy to some extent. However, the dependency information helps the adversary get more information from the datasets. In contrast, an enhanced DP mechanism performs much better in preserving the individual's privacy by leaking less information when an adversary has access to the dependency information. The privacy budget (ε) is the cost required to preserve privacy; a smaller value means better privacy. The standard DP and the proposed enhanced DP mechanism consume almost the same privacy budget, but the difference lies in the amount of information leaked. The proposed mechanism provides an acceptable tradeoff between privacy and utility, i.e., maintaining utility while leaking less information at a modest privacy cost. Therefore, the proposed mechanism is robust enough to tackle the dependent tuple problems in correlated datasets. In contrast, the standard DP does not consider the dependencies among the tuples of the datasets; thus, leaking more information. Consequently, it proves the efficiency of the proposed mechanism in achieving verifiable privacy assurance for the datasets containing dependent tuples.


Conclusion

Standard DP is a famous privacy framework but is unable to deal with the dependent tuples of correlated datasets. We have developed a robust, generalized, and enhanced DP mechanism that ensures rigorous privacy for datasets containing dependent tuples with maximal utility comparable to the standard DP. Furthermore, an inference attack revealing the weakness of the existing DP-based mechanisms and validating the robustness of the proposed solution is applied. Experimenting with large-scale social datasets has demonstrated that the standard DP leaks much information in a dependent assumption case. In contrast, the enhanced DP mechanism leaks less information with a modest privacy budget while maintaining the tradeoff between utility and privacy. Future work will explore more accurate solutions to measure precisely the dependence among tuples as it is crucial for achieving privacy with acceptable accuracy. Furthermore, there is a need for computationally efficient solutions and algorithms to deal with high-dimension correlated datasets.


Author’s Contributions

Conceptualization, GW, SA, SR. Funding acquisition, SA, GW. Investigation and methodology, SA, SR. Project administration, SA, GW. Resources, SA, GW. Supervision, SA, GW. Writing of the original draft, SA, SR. Writing of the review and editing, SA, SR. Software, SA, TR. Validation, SA, TR. Formal analysis, SA, SR. Data curation, SR, TR. Visualization, SR, TR.


Funding

This study was funded in part by the National Key Research and Development Program of China (No. 2020YFB1005804), the National Natural Science Foundation of China (No. 61632009), the High-Level Talents Program of Higher Education in Guangdong Province (No. 2016ZJ01), and the National Centre of Robotics and Automation, NUST, Pakistan (No. RF-017).


Competing Interests

The authors declare that they have no competing interests.


Author Biography

Author
Name : Saqib Ali
Affiliation : School of Computer Science, Guangzhou University, Guangzhou, China.
Biography : Dr. Saqib Ali holds a Ph.D. degree in Computer Science from Universiti Teknologi Malaysia. Currently, he is working as a postdoctoral research fellow in the School of Computer Science, Guangzhou University, Guangzhou, China. He is a part of the PingER project led by the SLAC National Accelerator Laboratory, Stanford, USA in collaboration with PERN, MYREN, CERNET, and Guangzhou University to monitor and analyze the performance of the Internet links using advanced machine learning techniques. Over the last year, his research is inclined towards big data analytics, mobile computing, and artificial intelligence.

Author
Name : Guojun Wang
Affiliation : School of Computer Science, Guangzhou University, Guangzhou, China.
Biography : Dr. Guojun Wang received a B.Sc. degree in Geo-physics, M.Sc. degree in Computer Science, and a Ph.D. degree in Computer Science, at Central South University, China in 1992, 1996, 2002, respectively. He is a Pearl River Scholarship Distinguished Professor of Higher Education in Guangdong Province, a Doctoral Supervisor of School of Computer Science and Cyber Engineering, Guangzhou University, China. Previously, he had been a Professor at Central South University, China; an Adjunct Professor at Temple University, USA; a Visiting Scholar at Florida Atlantic University, USA; a Visiting Researcher at the University of Aizu, Japan; and a Research Fellow at the Hong Kong Polytechnic University, HK. His research interests include artificial intelligence, big data, cloud computing, mobile computing, trustworthy/dependable computing, cyberspace security, recommendation systems, and mobile healthcare systems.

Author
Name : Shazia Riaz
Affiliation : University of Agriculture, Faisalabad, Pakistan.
Biography :Shazia Riaz is serving as an assistant professor in the Department of Computer Science, Government College Women University Faisalabad, Pakistan. She received her BS and MS degree in Computer Science in 2005 and 2011, respectively. Currently, she is working towards a Ph.D. degree in Computer Science from the Department of Computer Science, University of Agriculture, Faisalabad, Pakistan. Her current research interest includes data mining, differential privacy, and the application of deep learning in big data analytics.

Author
Name : Tayyaba Rafique
Affiliation : University of Agriculture, Faisalabad, Pakistan.
Biography :Tayyaba Rafique received her M.Sc. degree in 2018. Currently, she is doing research for the award of MS degrees in Computer Science from the Department of Computer Science, University of Agriculture, Faisalabad, Pakistan. Her current research interest includes network security, privacy, and data mining.



References

[1] S. P. Kasivisiwanathan and A. Smith, “On the 'Semantics' of differential privacy: a Bayesian formulation,” Journal of Privacy and Confidentiality, vol. 6, no. 1, pp. 1-16, 2014.
[2] C. Dwork and J. Lei, “Differential privacy and robust statistics,” in Proceedings of the 41st Annual ACM Symposium on Theory of Computing, Bethesda, MD, 2009, pp. 371-380.
[3] C. Dwork, “Differential privacy,” in Automata, Languages and Programming. Berlin, Germany: Springer-Verlag, 2006, pp. 1-12. https://doi.org/10.1007/11787006_1
[4] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of Cryptography. Berlin, Germany: Springer-Verlag, 2006, pp. 265-284.
[5] D. Liben-Nowell and J. Kleinberg, “The link prediction problem for social networks,” in Proceedings of the 12th International Conference on Information and Knowledge Management, New Orleans, LA, 2003, pp. 556-559.
[6] A. Chaabane, G. Acs, and M. A. Kaafar, “You are what you like! Information leakage through users' interests,” in Proceedings of Annual Network and Distributed System Security Symposium (NDSS), San Diego, CA, 2012.
[7] D. Kifer and A. Machanavajjhala, “No free lunch in data privacy,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, Athens, Greece, 2011, pp. 193-204.
[8] D. Kifer and A. Machanavajjhala, “Pufferfish: a framework for mathematical privacy definitions,” ACM Transactions on Database Systems, vol. 39, no. 1, article no. 3, 2014. https://doi.org/10.1145/2514689
[9] J. Song, Z. Han, W. Wang, J. Chen, and Y. Liu, “A new secure arrangement for privacy-preserving data collection,” Computer Standards & Interfaces, vol. 80, article no. 103582, 2022. https://doi.org/10.1016/j.csi.2021.103582
[10] A. E. Azzaoui, H. Chen, S. H. Kim, Y. Pan, and J. H. Park, “Blockchain-based distributed information hiding framework for data privacy preserving in medical supply chain systems,” Sensors, vol. 22, no. 4, article no. 1371, 2022. https://doi.org/10.3390/s22041371
[11] Z. Lian, W. Wang, and C. Su, “COFEL: communication-efficient and optimized federated learning with local differential privacy,” in Proceedings of IEEE International Conference on Communications, Montreal, Canada, 2021, pp. 1-6.
[12] X. Feng and C. Zhang, “Local differential privacy for unbalanced multivariate nominal attributes,” Human-centric Computing and Information Sciences, vol. 10, article no. 25, 2020. https://doi.org/10.1186/s13673-020-00233-x
[13] M. M. Salim, I. Kim, U. Doniyor, C. Lee, and J. H. Park, “Homomorphic encryption based privacy-preservation for IoMT,” Applied Sciences, vol. 11, no. 18, article no. 8757, 2021. https://doi.org/10.3390/app11188757
[14] S. C. Cha, T. Y. Hsu, Y. Xiang, and K. H. Yeh, “Privacy enhancing technologies in the internet of things: perspectives and challenges,” IEEE Internet of Things Journal, vol. 6, no. 2, pp. 2159-2187, 2019.
[15] W. Wang, H. Xu, M. Alazab, T. R. Gadekallu, Z. Han, and C. Su, “Blockchain-based reliable and efficient certificateless signature for IIoT devices,” IEEE Transactions on Industrial Informatics, vol. 18, no. 10, pp. 7059-7067, 2022.
[16] B. D. Deebak, F. H. Memon, S. A. Khowaja, K. Dev, W. Wang, N. M. F. Qureshi, and C. Su, “Lightweight blockchain based remote mutual authentication for AI-empowered IoT sustainable computing systems,” IEEE Internet of Things Journal, 2022. https://doi.org/10.1109/JIOT.2022.3152546
[17] C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,” Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3-4, pp. 211-407, 2013.
[18] C. Liu, S. Chakraborty, and P. Mittal, “Dependence makes you vulnerable: differential privacy under dependent tuples,” 2016 [Online]. Available: https://www.ndss-symposium.org/wp-content/uploads/2017/09/dependence-makes-you-vulnerable-differential-privacy-under-dependent-tuples.pdf.
[19] A. Kontorovich, M. Sadigurschi, and U. Stemmer, “Adaptive data analysis with correlated observations,” 2022 [Online]. Available: https://arxiv.org/abs/2201.08704.
[20] D. Kifer and A. Machanavajjhala, “A rigorous and customizable framework for privacy,” in Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems, Scottsdale, AZ, USA, 2012, pp. 77-88.
[21] X. He, A. Machanavajjhala, and B. Ding, “Blowfish privacy: tuning privacy-utility trade-offs using policies,” in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, UT, 2014, pp. 1447-1458.
[22] H. Liu, Z. Wu, and L. Zhang, “A differential privacy incentive compatible mechanism and equilibrium analysis,” in Proceedings of 2016 International Conference on Networking and Network Applications (NaNA), Hakodate, Japan, 2016, pp. 260-266.
[23] B. Yang, I. Sato, and H. Nakagawa, “Bayesian differential privacy on correlated data,” in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Australia, 2015, pp. 747-762.
[24] H. Jiang, J. Pei, D. Yu, J. Yu, B. Gong, and X. Cheng, “Applications of differential privacy in social network analysis: a survey,” IEEE Transactions on Knowledge and Data Engineering, 2021. https://doi.org/10.1109/TKDE.2021.3073062
[25] R. Chen, B. C. M. Fung, P. S. Yu, and B. C. Desai, “Correlated network data publication via differential privacy,” The VLDB Journal, vol. 23, no. 4, pp. 653-676, 2014.
[26] T. Zhu, P. Xiong, G. Li, and W. Zhou, “Correlated differential privacy: hiding information in Non-IID data set,” IEEE Transactions on Information Forensics and Security, vol. 10, no. 2, pp. 229-242, 2015.
[27] J. Zhao, J. Zhang, and H. V. Poor, “Dependent differential privacy for correlated data,” in Proceedings of 2017 IEEE Globecom Workshops (GC Wkshps), Singapore, 2017, pp. 1-7.
[28] H. Wang and H. Wang, “Correlated tuple data release via differential privacy,” Information Sciences, vol. 560, pp. 347-369, 2021.
[29] D. Lv and S. Zhu, “Achieving correlated differential privacy of big data publication,” Computers & Security, vol. 82, pp. 184-195, 2019.
[30] K. Drakonakis, P. Ilia, S. Ioannidis, and J. Polakis, “Please forget where I was last summer: the privacy risks of public location (meta)data,” 2019 [Online]. Available: https://arxiv.org/abs/1901.00897.
[31] T. Zhang, T. Zhu, P. Xiong, H. Huo, Z. Tari, and W. Zhou, “Correlated differential privacy: feature selection in machine learning,” IEEE Transactions on Industrial Informatics, vol. 16, no. 3, pp. 2115-2124, 2020.
[32] S. Chen, A. Fu, S. Yu, H. Ke, and M. Su, “DP-QIC: a differential privacy scheme based on quasi-identifier classification for big data publication,” Soft Computing, vol. 25, no. 11, pp. 7325-7339, 2021.
[33] D. N. Reshef, Y. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti, “Detecting novel associations in large data sets,” Science, vol. 334, no. 6062, pp. 1518-1524, 2011.
[34] Y. Zhang, Q. Hu, W. Zhang, and J. Liu, “A novel Bayesian network structure learning algorithm based on maximal information coefficient,” in Proceedings of 2012 IEEE 5th International Conference on Advanced Computational Intelligence (ICACI), Nanjing, China, 2012, pp. 862-867.
[35] A. Mousavi and R. G. Baraniuk, “An information-theoretic measure of dependency among variables in large datasets,” in Proceedings of 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, 2015, pp. 650-657.
[36] R. Heatherly, M. Kantarcioglu, and B. Thuraisingham, “Preventing private information inference attacks on social networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 8, pp. 1849-1862, 2013.
[37] Y. Yang, Y. Li, and R. H. Deng, “New paradigm of inference control with trusted computing,” in Data and Applications Security XXI. Berlin, Germany: Springer, 2007, pp. 243-258.
[38] A. C. Benabdellah, A. Benghabrit, and I. Bouhaddou, “A survey of clustering algorithms for an industrial context,” Procedia Computer Science, vol. 148, pp. 291-302, 2019.

About this article
Cite this article

Saqib Ali1,2,*, Guojun Wang1,*, Shazia Riaz2,3, and Tayyaba Rafique2, Preserving the Privacy of Dependent Tuples Using Enhanced Differential Privacy, Article number: 12:43 (2022) Cite this article 1 Accesses

Download citation
  • Received3 September 2021
  • Accepted5 May 2022
  • Published30 September 2022
Share this article

Anyone you share the following link with will be able to read this content:

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords