ArticlesAll Issue
ArticlesTargeted Advertising That Protects the Privacy of SocialNetworks Users
• Carlo Blundo1, Carmen De Maio2,*, Mimmo Parente1, and Luisa Siniscalchi3

Human-centric Computing and Information Sciences volume 11, Article number: 18 (2021)
https://doi.org/10.22967/HCIS.2021.11.018

Abstract

Keywords

Introduction

Personalized advertising services are the most efficient and hence the most effective form of advertising disclosure. A closer match will lead to the delivery of more useful and interesting advertising contents to users. Hence, companies will benefit from services in finding a good match between the interests of a user and a set of given advertisements. Therefore, user-related information is a fundamental asset for the efficient and effective delivery of advertising. However, one major drawback for users of this closer match is that personal and sensitive data concerning users’ preferences are traded in the name of personalized advertising. Such a scenarioinevitably raises evident privacy issues for the users. In fact, personal or sensitive data (e.g.,pertaining to a user’s habits, needs, etc.) are mined by infrastructures that operate in the shadows, with virtually no oversight.
Starting from the present authors’ previous works [57], this paper presents a privacy-preserving matching service capable of determining when and how to present an advertisement to a user of anonline social network (e.g.,a Twitter user) without disclosing their sensitive data.

Fig. 1. Overall approach, where tpc stands for the generic topic.

Experiments on the advertisement-user matching service have been conducted on tweets obtained using the Twitter API. The matching was measured using precision, recall, and F-measure, which returned good values. Here, it is worth stressing that, with respect to the privacy-less advertisement-user matching service described in Section 5.1, the privacy layer does not increase by much the overall computational complexity, since the cryptographic tools used are very efficient.
Note that, in a preliminary version of this work [12], a basic version of the privacy layer was added on top of the matching protocol. In this paper, the level of security was improved by the addition of aprivacy layer. In particular, Section 5.2.2 discusses some additional data that an adversary can deduce using the solution in [12],while Section 5.2.3 presents a way to get rid of this leak. Moreover, Section 5.2.4 describes an alternative privacy-layer solution whereby a third actor (i.e., a semi-trusted third part) interacts with both the social network and the advertiser. This solution makes it necessary to trusta third entity, but it has the advantage of being more efficient thanthe previous one.
The rest of this paper is organized as follows: Section 2 discusses some other proposals that are related to this paper’s protocols; Section 3 presents the cryptographic tools needed in the private protocol; Section 4 presentsthe semantic data analysis phase; Section 5 describes the Private Matching protocols; Section 6 presents some experimental results; and Section 7 presents the conclusion.

Related Work

Finally, [23] exposes in some detail the privacy risks facing users when their data are used in clear to profile them and to show them advertisements. In [23], it is noticed that online advertisements usually use a system of real-time bidding (RTB), which enables advertisers to compete in real-time auctions to show their ads. The authors of [23] observe that this type of mechanism promotes an aggressive user profiling strategy,and goes on to provide some solutions to discourage this phenomenon; however, this is a different approach w.r.t. consider an appropriate level of privacy on top of the matching protocol.

Cryptographic Primitives

The cryptographic primitives that are at the basis of the private protocols are summarized in the following. First, the notion of the private set intersection cardinality(PSI-CA) protocol was introduced. PSI-CA is the building block of the private matching protocol. Then, a variant of the ElGamalcryptosystem, that will be used to improve the protocol’s privacy, is recalled.

Private Set Intersection CArdinality (PSI-CA)
A PSI-CA protocol is a cryptographic protocol between two players: a Client which has a set X = {$x_1$,...,$x_n$} and a Server which has a set Y = {$y_1$,...,$y_m$}. When the protocol terminates, the Server does not acquire any information except the Client’s set size (i.e. |X|). Instead, the Client learns |X ∩Y| and the size of the set Y.
The work of [25] formally defines the security of a PSI-CA protocol. In the sequel, the security requirements of a PSI-CA protocol are informally recalled. More details and applications can be found in [26[28]]. The security requirements of a PSI-CA protocol are as follows:

Server Privacy: The only information that the Client learns is|Y | and the size of the intersection between X and Y.

Client Privacy: The Server only learns the size of X.

Unlinkability: The players (Client and Server) cannot determine if two running of the protocol are executed on the same inputs, unless this is inferred from the protocol’s output.

In the remaining sections, (⟨δ,|Y |⟩,⟨⊥,|X|⟩) ← PSI-CA(X,Y )is used to indicate that the Clientand the Server, on input, respectively, the setsX and Y, execute a PSI-CA protocol and, at the end, the Client outputs δ = |X ∩ Y | and the size of the Server’s set, while the Server has no output (i.e.,⊥) beyond the size of the Client’s set X.
In this paper, semi-honest security is considered; in other words, the players are assumed to be Honest-but-Curious. In this security setting, neither player can diverge from the protocol flow; however, all players could potentially collect sensitive data from both the execution of the protocol and the protocol’s output.

Exponential ElGamal Variant
The exponential ElGamalvariant is a cryptosystem based on the ElGamalcryptosystem [29]. It does not support computationally efficient decryption, but it does permit the private key owner to check whether a ciphertext corresponds to an encryption of 0. In such a cryptosystem, messages are elements of an additive group of integers modulo a prime p (i.e., messages belong to Zp = {0, . . . , p − 1}). The exponential ElGamalvariant cryptosystem (EGV) can be described by the three algorithms (Gen, Enc, Dec) given below.

Gen($1^λ$): Given the security parameter $λ$, the algorithm Gen randomly generates the pair of private/public keys ($sk,pk$). Such an algorithm performs the following steps:
1. Randomly generates a prime p of $λ$ bits.
2. Computes a generator g of the multiplicative group $Ζ_p^*$ of integers
modulop(i.e.,$Ζ_p^*$={1,...,p-1}).

3. Randomlychoosesκ∈{2,...,p-2}.
4. Setssk=κandpk=(p,g,y=$g^κ$ mod p).
5. Outputs (sk,pk).

Enc(pk,m):Given the public key pk = (p,g,y) and a message m ∈$∈Ζ_p$, the algorithm Enc outputs an encryption c of m. Such an algorithm performs the following steps:
1. Randomlychoosesr∈{2,...,p-2}.
2. Computesα=$g^r$ mod p and β=$y^r$•$g^m$ mod p.
3. Outputs c = ($α, β$).

Dec ($sk,c$): Given the secret key sk = κ and a ciphertext c = ($α,β$), the algorithm Dec outputs the decryption m of c. Such an algorithm performs the following steps:
1. Computesμ = β/$α^k$ modp.
2. Outputs the discrete logarithm modulo p of μ.

The security of the above cryptosystem relies on how the prime number p is chosen.Indeed, p has to be selected in such a way that it is hard (i.e. computationally unfeasible) to compute the discrete logarithmof any value in $Ζ_p^*$ (Given a generator g of $Ζ_p^*$ and a value $a∈Ζ_p^*$, the discrete logarithm of a mod p is the integer $e∈Ζ_p^*$ such that a = gemod p). State-of-the-art algorithms used to compute the discrete logarithm over $Ζ_p^*$. have a O(√p) computational complexity. Hence, p’s bitlength should be quite large; it is suggested to choose p having about 2,048 bits (i.e. the security parameter λ should be at least 2,048). Algorithms Gen and Enc could be efficiently computed, while Dec could not. Indeed, since y = $g^k$ mod p, the following relation holds

(1)

Therefore, m can be computed in a reasonable amount of time (e.g., by trying all possible values) only if m’s bitlength is small, say about 40 bits. In any case, the private key owner can immediately verify whether a ciphertext is an encryption of 0 as, in this case, μ would be equal to one. Indeed, for some random value $r, Dec$($sk,Enc$($pk$,0))=Dec($g^r$,$y^r$ •$g^0$)=$g^0$ =1.
Note that the Exponential ElGamal Variant provides additive homomorphism. Given the encryption of two messages, say m1and m2, their encryptions can be multiplied component-wise to obtain the encryption of their sum. Indeed, for somerandomvaluesr1,r2∈{2,...,p−2}andapublickeypk,then

(2)

Given an encryption ($𝛼,𝛽$)= $𝐸𝑛𝑐(𝑝𝑘,𝑚)$ of a message m under a public key pk and a constant value 𝑐 $∈Ζ𝑝∗$, the component-wise exponentiation of $𝐸𝑛𝑐𝑝𝑘,𝑚$ is denoted with $𝐸𝑛𝑐(𝑝𝑘,𝑚)^𝑐$, that is $𝐸𝑛𝑐(𝑝𝑘,𝑚)^𝑐$ = ($𝛼^𝑐$,$𝛽^𝑐$). The following relation𝐸𝑛𝑐(𝑝𝑘,𝑚)𝑐= $𝐸𝑛𝑐(𝑝𝑘,𝑐 • 𝑚)$ holds as

(3)

Such homomorphic properties will be used in the proposed Private Matching Protocol with enhanced security.

Pseudorandom Permutation
Informally, pseudorandom functions are efficiently computable and deterministic functions which can return pseudorandom output that is indistinguishable from random sequences. A pseudorandom function family (PRF) is a collection of pseudorandom functions, each of which can be chosen using a key κ. More precisely, each function can be indexed by a key κ. Hence, randomly choosing κ will implicitly choose a pseudorandom function within the family. A pseudorandom function family can be described by a function F where ∶ {0,1}$^k$ ×{0,1}$^n$ → {0,1}$^m$ . For any ∈{0,1}$^k$ , the function F (κ,•) defines the mapping x⟼ F (κ,x) that is efficiently computable. Given the previous definition, it is possible to define a pseudorandom permutation family (PRP) as a collection of pseudorandom functions where m = n.

Semantic Data Analysis

Semantic data analysis extracts underlying concepts and topics that correspond to the unstructured input text. More specifically, it performs a semantic enrichment of the text through a Wikify algorithm [30]. Briefly, Wikify is typical practice for linking the most relevant keywords included in a piece of text with the corresponding definition given in Wikipedia (The Wikification service was originally provided by http://wikipedia-miner.cms.waikato.ac.nz/, but is now only available as a private local installation on one’s own hardware. The Wikification source code is available at https://github.com/dnmilne/wikipediaminer).
Wikify assigns a unique identity to entities mentioned in the text. It extracts a set of ⟨tpc, rel⟩ pairs consisting of tpc, one of the Wikipedia page corresponding to detected topics or concepts and the corresponding degree of relevance rel [30].
Following, was reported an example by considering the tweet in Fig. 2.

Fig. 2. Tweet of Barack Obama on September 17, 2019.

tweeti = “Just 16, @GretaThunberg is already one of our planet’s greatest advocates. Recognizing that her generation will bear the brunt of climate change, she’s unafraid to push for real action. She embodies our vision at the @ObamaFoundation: A future shaped by young leaders like her.”

An example of Wikification output is the set of following pairs: ⟨Climatechange, 0.81⟩, ⟨Advocate, 0.72⟩, ⟨Climate, 0.72⟩, ⟨Visualperception, 0.7⟩, . . . .
In addition to the Wikify service, the system can calculate the Semantic Relatedness sr between a pair of Wikipedia pages. Given two topics corresponding to Wikipedia pages, the Semantic Relatedness is a value in the range of [0,1] that describes the distance/closeness of the input pages. It relies on evaluating the ratio of incoming and outgoing links shared by the input Wikipedia pages.

(Private) Matching

As introduced in [7], this research work, when given an advertisement at a certain time, uncovers potentially interested users in that ad. Note that topics characterizing users and ads could contain sensible information which should not be disclosed. Hence, given the advertising matching protocol discussed in Section 5.1, the main goal here is to introduce a privacy layer on the top of it so as toensure that both the advertisements’ topics and those of the users stay private.

A matching procedure is defined by taking into account the rough sets representing the user and the advertisement. More formally, given a threshold ε,the users’ posts, the corresponding timestamp, the extracted topics T, and the Semantic Relatedness sr, a binary relation $R^ε$ on T is induced as follows:

(4)

Let U = U(u, t), the set of Wikipedia concepts (i.e., topics) extracted by all tweets posted by u at time slot t, and let A be the set of concepts extracted from a given advertisement; then consider the restriction of $R^ε$ to U setting:

(5)

and letting $R_U^ε$(a)= {b ∈ U ┤|a $R_U^ε$ b},} the Orthopair O(A) is defined as O(A)=(L(A),E(A)),where

(6)

(7)

During the timeline in a specific advertisement, to identify the degree of interest of users, are determined the number of concepts of U and A that are in L(A) andE(A) (See Procedure 1).
The triple r = ($r_1$,$r_2$,$r_3$) is defined as follow:

(8)

From definition (8), it is easy to verify that

the topics of U ∩ A are at least the r1% of all the topics of U.

at least the r2% of the topics of U are different from the topics of A;

the topics of U ∩ A are at least the r3% of all the topics of A.

Moreover, setting

(9)

the value r4 is referred to as the value of uncertainty of u at time t. A simple algebra shows that

(10)

whereB(A) defined as

(11)

represents the $R_U^ε$ -boundary region of A. The values $r_1$,$r_2$,$r_3$, and $r_4$ enable the collection of valuable information to understandthe potential level of a user’s interest in a given advertisement. To establish a unique indicator for u, a weight w depending on $r_1$,$r_2$,$r_3$, and $r_4$ was evaluated as follow:

(12)

where hm($r_1$,$r_3$) is defined as

(13)

(i.e.,the harmonic mean of $r_1$ and $r_3$). In conclusion, at time t, the advertisement A is shown to the Twitter user u if $w ≥ λ$, for a threshold $λ$ ∈ [0, 1].
Procedure1. orthopair(U, A, ε)
L(A)←∅, E(A)←∅
for allain U do
B←∅
for allbinUdo
ifsr(a, b) ≥ εthen
B = B ∪ {b}
endif
end for
ifB ⊆ Athen
L(A)← L(A)∪ B
end if
ifB∩ A= ∅ then
E(A)← E(A)∪ B
end if
end for
return(L(A), E(A))

Note thatU and A can contain sensitive information and thus should not be disclosed. For instance, since the topics in U are extracted by user profiling they could contain valuable information like user habits, preferences, and lifestyles. On the other hand, the topics in A have some business value because they represent a way of promoting a product. Hence, the goal of this section is to introduce a privacy layer on top of the advertising matching protocol presented in Section 5.1.
The aim is to ensure that the topics from both the advertisements (i.e., the topics in A) and the user’s messages (i.e., the topics in U) stay private. Once U and A have been computed, the technique of Section 5.1 allows only one player to calculate the matching (either the Advertiser or, more realistically, the Online Social Platform). If only one player (e.g., the Advertiser) computes the matching (i.e., the value w given by Equation (8)), it is easy to see that the Advertiser learns the topics in U. To avoid this issue,a different path was taken: the value wwill be computeinteractively. The Advertiser interacting with the Online Social Platform will compute the value w.
Despite the interaction, the protocol will be able to hide the topics characterizing the advertisement to the Online Social Platform and, at the same time, the Advertiser will be unaware of the topics profiling a given user. In order to calculate wwhile simultaneously guaranteeing the privacy of the topics in A and U,the protocol will make use of a specific multi-party computation protocol [31, 32],namely, the PSI-CA protocol described in Section 3.1 above.

A different view of the matching protocol
As a first step, to add privacy to the advertising matching protocol as discussed in Section 3.1, some of its computations will be rearranged. Then, some computations are transformed in such a way that the topics’ privacy is preserved. Setting $s_L$= |L (A) |,$s_ℇ$ = |ℇ(A)|,$s_A$ = |A|, and $s_U$ = |U|and using Equations (5), (6), and (9), a simple algebra shows that Equation (8) can be written as follows:

(14)

Indeed,

(15)

Note that, if $s_L$ = 0, then r1 = r3 = 0 and, according to Equation (8), whas to be 0. This is correctly captured by the above re-writing of w. Therefore, to compute the value w, the only information required is the size of the sets L (A), ℇ(A), A, and U.
It is immediate to see that the execution of the Advertising Matching Protocol can be organized into three phases as follows:

Given U, compute $R_U^ε$ (u), for u ∈ U.

Given $R_U^ε$ ($u_1$ ),…,$R_U^ε$ ($u_{sU}$ ), and$A$, $computes_L$ and $s_ℇ$.

$Givens_U$,$s_A$,$s_L$ and $s_ℇ$, compute w=2∙$\frac{s_L(s_L+s_ℇ)}{s_U(s_A+ s_U)}$.

The Online Social Platform can compute the values occurring in the first phase as it knows the users’ posts and habits from which it is possible to extract the topics characterizing them. The second and third phases can be executed by the Advertiser,which, upon interacting with the Online Social Platform, computes the values $s_U$, $s_A$, $s_L$ and $s_ℇ$, from whichthe parameter w can be derived, according to Equation (10).
Note that all the elements in the set L(A)are elements in U, as this fulfills Relation (3),which states that an element u ∈ U belongs to L(A)if, and only if, $R_U^ε$ (u)⊆A.
In order to test whether $R_U^ε$ (u)⊆A, one could equivalently check whether |$R_U^ε$ (u) ∩ A| = |$R_U^ε$ (u)|. Hence, for any u ∈ U, is derived that u ∈ L(A)iff|$R_U^ε$ (u)∩A| = |$R_U^ε$ (u)|. Therefore, to compute $s_L$ it is enough to count how many $R_U^ε$ (u)’s, with u ∈ U, satisfy |$R_U^ε$ (u)∩ A| = |$R_U^ε$ (u)|. Previous arguments can also be applied to compute $s_ℇ$. In this case, according to Relation (4), the condition to check is whether |$R_U^ε$ (u) ∩ A| = 0.
In the next section, theaforementioned observationsare exploited in order to transform the Advertising Matching Protocol presentedin Section 3.1 into a private one by also resortingto a multiple sequential running of a PSI-CA protocol.

Private matching protocol
Fig. 3 describes the first version of the proposed private matching protocol. Such a protocol is divided into three phases. The first one is executed by the Online Social Platform,which, Given U = {$t_1^u$ ,...$t_{sU}^u$ } and ε, computes the sets $R_i$ = $R_U^ε$ ($t_i^u$), for i = 1, . . . ,sU. During the second phase, Advertiser and Online Social Platform, on input A and $R_i$ respectively, interact by
Fig. 3. Private matching protocol.

running a PSI-CA protocol. The Advertiser, after each protocol’s execution, computes the size δi of the intersection A ∩ $R_i$ and the size of the set $R_i$ , while the Online Social Platform only computes the size of the set A. The third phase is executed by the Advertiser,which computes the value w according to equation (10).
Note that, in order to compute the size of the sets comprising the orthopairO(A) = (L(A),ℇ(A)) (i.e. the values $s_L$ and $s_ℇ$, the Advertiser resorts to the trick described at the end of Section 5.2.1. The Online Social Platform, upon receiving at the end of Phase III the value w from the Advertiser, determines whether to show the advertisement represented by the topics in A to a user characterized by the topics in U.
A note on computingw:Here, the authors draw attention to the fact that, in the private matching protocol presented in Fig. 3, the Advertiser directly computes the value w by itself. Were it otherwise, some information could be leaked. Indeed, the Online Social Platform, in order to quantify the value w that will allow it to distinguish between whether to show or not show an advertisement to a user, has to compute the values $δ_i$ = |A ∩ $R_i$ |, for i = 1,…,$s_U$. The values $δ_i$ can be computed by reversing the computations executed within the PSI-CA protocol. In this way, the Advertiser computes ⟨⊥,|$R_i$ |⟩, while the Online Social Platform computes $⟨δ_i$ ,|A|⟩, or i = 1,...,$s_U$ . This could have some privacy-leakage aftereffects. Indeed, assume that the sets $R_1$ ,...,$R_{SU}$ have different sizes (i.e., |$R_i$ | ≠ |$R_j$ | when i ≠ j). Then, the Online Social Platform, upon computing $δ_i$ = |$R_i$ ∩ A|, can easily check whether $δ_i$ = |$R_i$ |, for some i ∈ {1,...,$s_U$ },discovering that each topic in $R_i$ characterizes the advertisement A. In this way, the Online Social Platform accumulates some limited findings on how A is composed, thereby potentially impairing the Advertiser’s privacy.
A note on the attained privacy level: According to the steps executed in the proposed protocol, the topics in U and A (i.e., respectively, the topics of the Advertiser and of the Online Social Platform) are concealed. Nevertheless, although the protocol preserves topics’ privacy, the Advertiser grasps, in addition to w, some additional information. It was mentioned above that, for i ∈ {1,...,$s_U$}, the Advertiser knows $δ_i$ = |$R_i$ ∩ A|. Since the Advertiser cannot determine whether a topic in A belongs to $R_i$, as well, this event is not considered to be a demanding privacy breach. In any case, Section 5.2.3 will present a technique designed to withstand such a privacy leakage. The solution will be based on the use of fake topics in both $R_i$ and A. In this paper’s improved protocol, because of the use of the EGV cryptosystem, the Advertiser will only verify either whether $R_i$ ∩ A does not contain any element (i.e., |$R_i$ ∩ A| = 0) or whether |$R_i$ ∩ A| = |$R_i$|, for i ∈{1,...,$s_U$}. From such insights, the Advertiser cannot infer any sensitive information on the topics in $R_i$.
The value w computed by the Advertiser can suggest that an unidentified user might be attracted by the advertisement characterized by A. The parameter w describes how A is close to U. Consequently, the Advertiser could infer an estimate of the number of users engaged within the advertising campaign. Knowing such an estimate is not a negative feature of our protocol, theAdvertiser could profitably use such information to infer an assessment of the advertising campaign’s effectiveness by estimating the number of target users likely to be reached. Moreover, to determine the profits to ascribe to the Online Social Platform, the Advertiser usually implements other methods (of which some examples can be found in [33-36]). Therefore, the information leaked by the proposed protocol can anyhow be readily learned by the Advertiser through different processes.

Increasing privacy of the private matching protocol
In the previous section, it is pointed out that the private matching protocol described in Fig. 3 suffers from some minor privacy leaks. Such a privacy problem can be avoided by resorting to the introduction of fake topics and by using the EGV cryptosystem.
In this paper’s amended protocol, to mask the real topics characterizing the advertisement and the users, the set of fake topics Dummywas resorted to such that Dummy ∩ T = ∅.
As such fake topics do not have to represent any particular attribute, they can simply be random strings. Both the Advertiser and the Online Social Platform know the set Dummy, but such a set is used differently by the two players. More precisely, the Advertiser, instead of using the set A, will use the set A' = A∪Dummy; while the Online Social Platform will interact with the Advertiser using the set $R_i$' = $R_i∪D_i$, for a randomly chosen subset $D_i$ $⊆_R$ Dummy. The symbol $⊆_R$ denotes the selection, by the Online Social Platform, of a random subset from Dummy.
Note, however, that knowing the set Dummy does not allow the finding of set A as the two sets Dummy and Aare disjoint sets.
Moreover, since Dummy∩T =∅, and thus|$A$'∩$R_i$' |=|A∩$R_i$ |+|$D_i$ |, the size of the intersection A ∩ $R_i$ is masked with a random value |$D_i$|. Applying the homomorphic properties of the EGV cryptosystem, the Advertiser, by interacting with the Online Social Platform, will obtain an encryption of | A ∩ $R_i$|. Such a value will be encrypted under the Advertiser’s public key.
Therefore, according to the EGV properties, the Advertiser will efficiently decrypt it if, and only if, either $R_i$⊆ A or A ∩ $R_i$ = ∅. In the protocol shownin Fig. 4, ($sk_A$,$pk_A$) denotes the Advertiser’s private/public key pair.
Fig. 4. Enhanced private matching protocol.

The protocols in Figs. 3 and 4 follow the same pattern. Both are divided into three phases,of which the first two are almost identical. The only differences are in the sets A' and $R_i$' used for the execution of the PSI-CA protocol. Anyway, there are few observations to be pointed out. This study resorts to dummy topics to hide the size of the intersection | A ∩ $R_i$|. Indeed, after running the PSI-CA(A',$R_i$'), the Advertiser learns the value |A'∩$R_i$'| = |A∩$R_i$ |+|$D_i$ | where the size of A ∩ $R_i$ is blinded by the value |$D_i$|.
In the third phase, by using the additive homomorphism of the EGV cryptosystem, the Online Social Platform (who randomly selected the set $D_i$) can be allowed to un-blind | A ∩ $R_i$| without knowing the size of the intersection A ∩ $R_i$. The Online Social Platform could accomplish this task simply by computing

(16)

and sending this value to the Advertiser. Note, however, that this is not enough since the Advertiser, can recover$g^{|A∩R_i|}$ by decrypting Enc($pk_A$,|$A$'∩$R_i$' | ), and since | A∩$R_i$| is a small value, the Advertiser can obtain | A∩$R_i$|. This event must be avoided. To fix such a leakage, the Online Social Platform masks | A∩$R_i$| with a random value $ρ_i$ $∈_R$ $Z_p^*$ by computing

(17)

Note that, during Phase III of the enhanced private matching protocol, computing ($α,β$) =(〖($α_i$ $α_i'$ )$〗^{ρi}$ ,〖($β_i$ $β_i'$)$〗^{ρi}$) is equivalent to computing(α,β), as shown in (11). This simple trick solves the problem as the Advertiser, by decrypting (α,β),obtains the following:

(18)

Therefore, $γ_i$ = 1 if, and only if,A∩$R_i$ = ∅and that$γ_i$ = $g^{ρ_i•|R_i|}$ if, and only if,|A∩$R_i$|= |$R_i$| (i.e., $R_i$⊆ A). In the other cases, when A∩$R_i$≠ ∅and |A ∩ $R_i$ | ≠ |$R_i$|, due to the random nature of $ρ_i$, it results that $γ_i$ is just a random value in $Z_p^*$. Hence, the Advertiser does not learn any information on | A∩$R_i$| . Note, however, that the discrete logarithm of $γ_i$ cannot be efficiently computed as $γ_i$ is uniformly distributed in $Z_p^*$ and $ρ_i$ •|A ∩$R_i$ | is not necessarily a small value.

Outsourcing private matching
In this new scenario, to protect topics’ privacy, the dummy topics trick is used again, as in the protocol described in Fig. 4. It is assumed that the Online Social Platform and the Advertiser share a key κ of a pseudorandom permutation F, while the service provider knows the Advertiser’s public key $pk_A$ of the EGV cryptosystem. Starting from a common input, say an n-bits all zeros string (i.e.,$k_0$ = 0), and using the pseudorandom permutation F, the Online Social Platform and the Advertiserthe temporary keys $k_i$= F (κ,$k_{i-1}$ ),for i = 1, . . . ,sU.. The new protocol is quite similar to the one shown in Fig. 4. Its execution is divided into three phases, of which the first is identical to that of the protocol shown in Fig. 4. The second phase is substituted by $s_U$ interactions with the service provider (i.e., with $s_U$ invocation of an outsourced PSI-CA protocol). An outsourced PSI-CA protocol can be devised by considering the following simple steps. For = 1,...,$s_U$, the Advertiser sends to the service provider the set$F_{A_i}$ = {F($k_i$,t) | t ∈ A∪Dummy}; while the Online Social Platform sends to the service provider the set $F_{R_i}$ ={F($k_i$,t)|t∈$R_i$∪$D_i$},forarandomlychosensubset $D_i$⊆Dummy. To be precise, both parties send to the service provider the elements in each set in a random order, i.e., they send a shuffling of the sets. This is necessary to prevent some information, e.g., the lexicographic order of the topics in the sets,being inferred by the service provider from the order of such elements in the received set. Note that the pseudorandom permutation F is defined as F : {0,1}$^k$ × {0,1}$^n$ → {0,1}$^n$, while $k_i$ is n bits long. This is not ad issue at all as, if n > k, then $k_i$ is truncated to its first n bits; while if n<k, then $k_i$ is padded with n-k zeroes.
The application of the pseudorandom permutation to the topics in A ∪ Dummy and in $R_i$∪$D_i$ protectsthe topics’privacy. The service provider computes the value $δ_i$ = |$F_{A_i}$ ∩ $F_{R_i}$ | = |A ∩ $R_i$ | + |$D_i$| and sends Enc($pk_A$,$δ_i$) to the Online Social Platform. The third and last phase is almost identical to the one shown in Fig. 4,the only difference being that the Advertiser does not send to the Online Social Platform the first message (i.e., Enc($pk_A$,$δ_i$)); rather, such a message is received by the Online Social Platform from the service provider.
Note that the protocol discussed in Section 5.2.4 is more efficient in terms of communication because there are fewer interactions between the Online Social Platform and the Advertiser. In addition, from a computational point of view, a large part of the computation is delegated to a (semi-trusted) third party. Consequently, both theOnline Social Platform and the Advertiser compute only evaluations of pseudo-random functions and, instead, the protocol presented in Section 5.2.2 requires them to compute expensive public-key operations. On the other hand, the solution proposed in Section 5.2.2 does not require the existence of a semi-honest third party.

Experiments

This section describes the data set, measures, and results of the experiment conducted to assess the quality of the methodologies and techniques used to detect a sample of target users to whom the ads are to be given.
A real data-setcovering a period of6 months (January to June 2019), and comprising some 400,000 tweets posted from about 6,000 users, was acquired by using the Twitter Streaming API (https://dev.twitter.com/streaming/overview). In order to evaluate the proposed approach, 10 tweets were selected as promoted ads on such topics as sports and games, and the activity of each user, in terms of tweets/re-tweets and likes, was captured. Table 1 shows further details of the dataset statistics that were collected in different time slots (i.e., in [midnight, noon] and [noon, midnight]).

Table 1. Data-set statistics: tweets collected in time slots [midnight, noon] and [noon, midnight]

Time slot 1 Time slot 2 Total
Re-tweets/replies 8,232 12,030 20,262
Tweets 198,323 234,432 426,755
Users 2,736 3,252 5,988
To evaluate the performance of the proposed approach in detecting a set of users interested in a specific advertisement, several measures of information retrieval systems were used: Precision, Recall, F-Measure, False Positive Rate, and Accuracy.
Given the set of all analyzed users U, an advertisement “A”, and a specific time slot t let:

$U^*$ = {$u_1^*$,$u_2^*$,...,$u_m^*$} denote the set of users interested in “A” within the fixed time slot t. Such a set represents the gold set whose elements have been handpicked by domain expert users;

U ̃ = {($u_1$) ̃ ,($u_2$) ̃ ,...,($u_n$) ̃ } denote the set of users selected by the proposed methodology (i.e., the result set).

The measure Precision represents the value of the fraction of retrieved users who are relevant. For the specific time slot t, it is defined as:

(19)

The measure Recall represents the value of the fraction of relevant users who are actually retrieved. It shows how good the methodology is at detecting target users. For the specific time slot t, it is defined as:

(20)

The measure F-Measure computes how much (in terms of quality and quantity) the retrieved users are interested in the advertisement. For the specific time slot t, it is referred to as Ftand defined as:

(21)

The measure False Positive Rate (FPR) is the ratio of the number of uninterested users wrongly categorized as interested to the total number of uninterested users. For the specific time slot t, it is referred to as Ftand defined as:

(22)

The measure Accuracy is the fraction of predictions which our model got right. For the specific time slot t, it is defined as:

(23)

Fig. 5 shows the performances tested in terms of these measures for two different time slots (i.e., [midnight, noon] and [noon, midnight]) with a different threshold ε ∈ [0.0,1.0] (See Equation (1)).
Fig. 5. Evaluation results obtained by varying the level of the threshold ε∈ [0, 1] in two time slots [midnight, noon] and [noon, midnight].

As shown in Fig. 5, the proposed approach reveals good performances with a threshold ε∈ [0.6, 0.8], with an acceptable value of Recall and Precision and a good value of Accuracy ≈ 0.9. These metrics quantify how much and with what degree of accuracy the gold set is covered by the result set (of selected users) for a specific time slot, by varying the semantics threshold ε (See Equation (1)).
Furthermore, the graph in Fig. 5 shows that better results are obtained in the second time slot [noon, midnight]. An analysis of the data indicates that the quality of the results is improved when the intensity of the tweets posted is greater than in the other time slot. In the second time slot [noon, midnight], agreater flow is available, allowing a better characterization of the users and, consequently, a more effective correspondence. Furthermore, by comparing the Recall to FPRcurve, the proportion of uninterested users that end up being targeted can be emphasized. As can be seen, by varying the threshold, the proposed approach almost always reveals an acceptably low value for the false positive rate.

Conclusion

This paper presents a methodology for profiling social network users in a temporal window (time slot) in order to target them with advertisements most likely to be of interest to them. Since evident security risks exist, a privacy layer was designed to prevent leakages of the sensitive information of all the playersconcerned, including the users and advertisers. Considering this issue, different future directions may be taken. One may prefer to use completely different linguistics tools than“Wikify!” to extract topics from users’ tweets streams. Moreover, a comparison of the proposed approach based on rough set theory with alternative mathematical tools, like fuzzy sets [38], could also offer helpful insights. This paper does not consider real-time bidding (RTB). It would be interesting to analyze whether our techniques could be profitably used in RTB to make this process privacy friendly without resorting to client-side agents (i.e., proxies, browser’s extensions/plug-ins), as is the case in state-of-the-art works. Furthermore, in future, the authors plan to thoroughly compare the enhanced private matching protocol versus the outsource protocol, and to implement both of themso as to evaluate their performance in terms of running time and communication complexity (i.e., size of the messages exchanged by the Advertiser, the Online Social Platform, and, in the case of the outsourced version, the service provider). The authors will also assesswhether there is an optimal number of topics per ad and per user for minimizing the complexities of the protocol while maximizing the effectiveness of the advertisement campaign.

Acknowledgements

Not applicable.

Author’s Contributions

All the authors have participated in theresearch work and contributed to this manuscript. All the authors have read and approved the final manuscript.

Funding

None.

Competing Interests

The authors declare that they have no competing interests.

Availability of Data and Material

A real data-set was acquired by using the Twitter Streaming API.

Author Biography

Name : Carmen De Maio
Affiliation : Dipartimento di Ingegneria dell’Informazione ed Elettrica e Matematica Applicata, Università degli Studi di Salerno, 84084 Fisciano, Italy
Biography : AssistantProfessor, her research interests includeKnowledge Extraction,Social Media Analytics.

Name : Mimmo Parente
Affiliation : Dipartimento di Scienze Aziendali-Management and Innovation Systems (DISA-MIS), Università di Salerno, 84084 Fisciano, Italy
Biography : Professor of Computer Science, his research interests includeAutomatic Verificationof Systems and Text Mining.

Name : Carlo Blundo
Affiliation : Dipartimento di Scienze Aziendali-Management and Innovation Systems (DISA-MIS), Università di Salerno, 84084 Fisciano, Italy
Biography : Professor of computer science, his main research interests include cryptography and data security.

Name : Luisa Siniscalchi
Affiliation : Concordium Blockchain Research Center, Aarhus University, Aarhus, Denmark
Biography : Postdoctoral researcherat the Aarhus University, her research is focused on Cryptography

References

[1] J. Estrada-Jimenez, J. Parra-Arnau, A. Rodriguez-Hoyos, and J. Forne. “Online advertising: analysis of privacy threats and protection approaches,” Computer Communications, vol. 100, pp. 32-51, 2017.
[2] M. R. Solomon and T. L. Tuten, Social Media Marketing. Upper Saddle River, NJ: Pearson Education, 2012.
[3] C. W. Ho and Y. B. Wang. “Re-purchase intentions and virtual customer relationships on social media brand community,” Human-centric Computing and Information Sciences, vol. 5, no. 1, article no. 18, 2015. https://doi.org/10.1186/s13673-015-0038-x
[4] J. Deighton, L. Kornfeld, and M. Gerra, “Economic value of the advertising-supported internet ecosystem,” 2017 [Online]. Available: https://www.iab.com/wp-content/uploads/2017/03/Economic-Value-Study-2017-FINAL2.pdf.
[5] C. De Maio, G. Fenza, M. Gallo, V. Loia, and S. Senatore, “Formal and relational concept analysis for fuzzy-based automatic semantic annotation,” Applied Intelligence, vol. 40, no. 1, pp. 154-177, 2013.
[6] C. De Maio, G. Fenza, V. Loia, and M. Parente, “Time aware knowledge extraction for microblog summarization on Twitter,” Information Fusion, vol. 28, pp. 60-74, 2016.
[7] S. Boffa, C. D. Maio, B. Gerla, and M. Parente, “Context-aware advertisment recommendation on Twitter through rough sets,” in Proceedings of 2018 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Rio de Janeiro, Brazil, 2018, pp. 1-8.
[8] Z. Pawlak, “Rough sets,” International Journal of Computer & Information Sciences, vol. 11, no. 5, pp. 341-356, 1982.
[9] B. Walczak and D. L. Massart, “Rough sets theory,” Chemometrics and Intelligent Laboratory Systems, vol. 47, no. 1, pp. 1-16, 1999.
[10] J. Jarvinen, “Knowledge representation and rough sets,” 1999 [Online]. Available: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.27.8637&rep=rep1&type=pdf.
[11] A. Skowron and S. Dutta, “Rough sets: past, present, and future,” Natural Computing, vol. 17, no. 4, pp. 855-876, 2018.
[12] C. Blundo, C. D. Maio, M. Parente, and L. Siniscalchi, “An intelligent and private method to profile social network users,” in Proceedings of 2019 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), New Orleans, LA, 2019, pp. 1-6.
[13] J. Estrada-Jiménez, J. Parra-Arnau, A. Rodriguez-Hoyos, and J. Forne, “Online advertising: analysis of privacy threats and protection approaches,” Computer Communications, vol. 100, pp. 32-51, 2017.
[14] A. Juels, “Targeted advertising ... and privacy too,” in Topics in Cryptology - CT-RSA 2001. Berlin, Germany: Springer, 2001, pp. 408-424.
[15] V. Toubiana, A. Narayanan, D. Boneh, H. Nissenbaum, and S. Barocas, “Adnostic: privacy preserving targeted advertising,” in Proceedings of the Network and Distributed System Security Symposium (NDSS), San Diego, CA, 2010, pp. 1–16.
[16] D. Sanchez and A. Viejo, “Privacy-preserving and advertising-friendly web surfing,” Computer Communications, vol. 130, pp. 113-123, 2018.
[17] J. Parra-Arnau, J. P. Achara, and C. Castelluccia, “MyAdChoices: bringing transparency and control to online advertising,” ACM Transactions on the Web, vol. 11, no. 1, article no. 7, 2017. https://doi.org/10.1145/2996466
[18] E. Androulaki and S. M. Bellovin, “A secure and privacy-preserving targeted ad-system,” in Financial Cryptography and Data Security. Berlin, Germany: Springer, 2010, pp. 123-135.
[19] S. Guha, B. Cheng, and P. Francis, “Privad: practical privacy in online advertising,” in Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation, Boston, MA, 2011, pp. 169-182
[20] A. Narayanan and V. Shmatikov, “Robust de-anonymization of large sparse datasets,” in Proceedings of 2008 IEEE Symposium on Security and Privacy (sp2008), Oakland, CA, 2008, pp. 111-125.
[21] S. T. Boshrooyeh, A. Küpçü, and O. Ozkasap, “PPAD: Privacy Preserving Group-Based ADvertising in online social networks,” in Proceedings of 2018 IFIP Networking Conference (IFIP Networking) and Workshops, Zurich, Switzerland, 2018, pp. 1-9.
[22] T. Boshrooyeh, A. Kupçu, and O. Ozkasap, “Privado: privacy-preserving group-based advertising using multiple independent social network providers,” ACM Transactions on Privacy and Security, vol. 23, no. 3, article no. 12, 2020. https://doi.org/10.1145/3386154
[23] J. Estrada-Jimenez, J. Parra-Arnau, A. Rodriguez-Hoyos, and J. Forne, “On the regulation of personal data distribution in online advertising platforms,” Engineering Applications of Artificial Intelligence, vol. 82, pp. 13-29, 2019.
[24] J. Jiang, Y. Zheng, Z. Shi, X. Yuan, X. Gui, and C. Wang, “A practical system for privacy-aware targeted mobile advertising services,” IEEE Transactions on Services Computing, vol. 13, no. 3, pp. 410-424, 2020.
[25] E. De Cristofaro, P. Gasti, and G. Tsudik, “Fast and private computation of cardinality of set intersection and union,” in Cryptology and Network Security. Heidelberg, Germany: Springer-Verla, 2012, pp. 218-231.
[26] C. Blundo, E. D. Cristofaro, and P. Gasti, “EsPRESSO: efficient privacy-preserving evaluation of sa.mple set similarity,” Journal of Computer Security, vol. 22, no. 3, pp. 355-381, 2014.
[27] E. De Cristofaro and G. Tsudik, “Experimenting with Fast Private Set Intersection,” in Trust and Trustworthy Computing. Heidelberg, Germany: Springer-Verlag, 2012, pp. 55-73.
[28] M. Ciampi and C. Orlandi, “Combining private set-intersection with secure two-party computation,” in Security and Cryptography for Networks. Cham, Switzerland: Springer Nature, 2018, pp. 464-482.
[29] T. Elgamal, “A public key cryptosystem and a signature scheme based on discrete logarithms,” IEEE Transactions on Information Theory, vol. 31, no. 4, pp. 469-472, 1985.
[30] R. Mihalcea and A. Csomai, “Wikify!: linking documents to encyclopedic knowledge,” in Proceedings of the 16th ACM Conference on Information and Knowledge Management, Lisbon, Portugal, 2007, pp. 233-242.
[31] M. Ciampi, R. Ostrovsky, L. Siniscalchi, and I. Visconti, “Round-optimal secure two-party computation from trapdoor permutations,” in Theory of Cryptography. Cham, Switzerland: Springer Nature, 2017, pp. 678-710.
[32] M. Ciampi, R. Ostrovsky, L. Siniscalchi, and I. Visconti, “Delayed-input non-malleable zero knowledge and multi-party coin tossing in four rounds,” in Theory of Cryptography. Cham, Switzerland: Springer Nature, 2017, pp. 711-742.
[33] M. Ion, B. Kreuter, E. Nergiz, S. Patel, S. Saxena, K. Seth, D. Shanahan, and M. Yung, “Private intersection-sum protocol with applications to attributing aggregate ad conversions,” 2017 [Online]. Available: https://eprint.iacr.org/2017/738.
[34] M. Ion, B. Kreuter, A. E. Nergiz, S. Patel, M. Raykova, S. Saxena, K. Seth, D. Shanahan, and M. Yung, “On deploying secure computing: private intersection-sum-with-cardinality,” 2019 [Online]. Available: https://eprint.iacr.org/2019/723.
[35] D. Liu, J. Ni, H. Li, X. Lin, and X. Shen, “Efficient and privacy-preserving ad conversion for v2x-assisted proximity marketing,” in Proceedings of 2018 IEEE 15th International Conference on Mobile Ad Hoc and Sensor Systems (MASS), Chengdu, China, 2018, pp. 10-18.
[37] S. Yuan, A. Z. Abidin, M. Sloan, and J. Wang, “Internet advertising: an interplay among advertisers, online publishers, ad exchanges and web users,” 2012 [Online]. Available: https://arxiv.org/abs/1206.1754.
[38] Y. Yang and C. Hinde, “A new extension of fuzzy sets using rough sets: R-fuzzy sets,” Information Sciences, vol. 180, no. 3, pp. 354–365, 2010.

Carlo Blundo1, Carmen De Maio2,*, Mimmo Parente1, and Luisa Siniscalchi3, Targeted Advertising That Protects the Privacy of SocialNetworks Users, Article number: 11:18 (2021) Cite this article 3 Accesses