ArticlesAll Issue
ArticlesKM-DBSCAN: Density-Based Clustering of Massive Spatial Data with Keywords
• Hong-Jun Jang1 and Byoungwook Kim2,*

Human-centric Computing and Information Sciences volume 11, Article number: 43 (2021)
https://doi.org/10.22967/HCIS.2021.11.043

Abstract

Density-based spatial clustering is one of the most famous algorithms for spatial clustering. Most clustering studies regarding the density-based spatial clustering of applications with noise (DBSCAN) have focused on efficiently finding groups for spatial objects with numeric data or categorical data. In this paper, we propose a novel density-based spatial clustering algorithm called keyword-matched DBSCAN (KM-DBSCAN), which deals with textual data as well as numerical or categorical data. KM-DBSCAN utilizes an IR2-tree as an index structure. To find keyword-matched objects in large datasets, it performs a nearest neighbor search using a branch-and-bound method. While traversing the IR2-tree, KM-DBSCAN effectively prunes unqualified nodes by utilizing both spatial and textual information in nodes. To demonstrate the efficiency of KM-DBSCAN, we conducted extensive experiments in various settings. The experimental results show that KM-DBSCAN is very efficient in terms of computational cost.

Keywords

Spatial Clustering, DBSCAN, Textual Information, IR2-Tree

Introduction

Spatial clustering, which groups a set of spatial objects into groups called clusters, is one of the key technologies for spatial data mining and spatial data analysis [1]. Spatial clustering ensures that spatial objects in the same cluster are similar to each other and dissimilar from those in other clusters. Recently, with the development of the geographic information system, there is increasing interest in, and research on, the development of new application services such as location-based services and telematics providing a route search, surrounding information searches, emergency services, etc. Spatial clustering is also widely used in the field of spatial data analysis such as land-use detection [2], crime hot-spot analysis [3], cartographic generalization [4], geographic customer segmentation [5], and earthquake analysis [6]. As the technology of finding useful knowledge in spatial data becomes more important, a variety of research on finding groups has been conducted, such as nearest base-neighbor searches on spatial datasets [7], nearest neighborhood searches in spatial databases [8], nearest group queries [9], and the grid-based K-prototypes algorithm [10].
Density-based spatial clustering of applications with noise (DBSCAN) clustering is one of the most famous algorithms for spatial clustering [1113]. DBSCAN requires the input of two parameters ɛ and minPtr. The neighborhood within radius ɛ of a given objects is called the ɛ-neighborhood of the object and an object with at least minPtr object within its ɛ-neighborhood is called a core object. DBSCAN can find clusters of variable sizes and shapes, and it will also detect noise. Most clustering studies into DBSCAN have focused on efficiently finding groups for spatial objects with numeric or categorical data. However, many real-world spatial objects have textual information, not just numeric data or categorical data. Hence, if the textual information affects spatial clustering results, the error value can increase when the final cluster results are evaluated. Table 1 shows an example of the data we are going to deal with. Spatial data have location information (latitude, longitude) for each shop, and textual information about the shop. The text information in the spatial data can be a variety of text descriptions obtained from newspaper articles, personal blogs, and social networking services. For simplicity, we assume that a textual description is a set of keywords such as tags [14].

Table 1. An example of spatial data with keywords

ID Type Latitude Longitude Textual information
1 bakery 14 6 coffee, sandwich, milk
2 restaurant 12 3 steak, pasta
3 bakery 30 4 cookie, ice cream
4 book store 31 9 book, snack
5 bakery 59 9 coffee, ice cream
6 bakery 12 8 coffee, sandwich, milk
7 restaurant 12 3 sandwich, steak, pasta
8 clothing shop 30 4 suits, necktie
9 book store 31 9 book, milk
10 bakery 59 9 cookie, ice cream

The data in Table 1 consists of the type of shop, the spatial data in which the shop is located, and the menu item data for each shop. In this case, we use as keywords the items that each shop sells. This data can be used in market area analysis, e.g., selecting the most effective advertising locations or the setting up of a new shop. For example, suppose a user wants to open a sandwich shop in an area where there are many other shops. To avoid overheated competition, it is reasonable for the user to want to choose an area with fewer sandwich shops. Moreover, not only should the area have no other sandwich shops, there should be many shops in the area that are often frequented by potential customers. Therefore, the user prefers large areas of high density for high sales, with no other sandwich shops in the area. Fig. 1 shows an example of density-based clustering on spatial data using keywords. Considering the location data only, the areas where restaurants are concentrated are a1, a2, a3and a4. Using the items shops have for sale (provided as text information), the areas where shops are crowded and where no shops sell sandwiches are a2 and a4. As such, if a concentrated area is searched for using keywords, a detailed area suitable for the user’s intention can be found.
Intuitively, this problem can be solved using a combination of a keyword search algorithm and the original DBSCAN algorithm referred to as a naïve keyword-based DBSCAN (NK-DBSCAN) algorithm. This combination approach first executes the original DBSCAN algorithm using only spatial data. With RangeQuery in the DBSCAN algorithm, this algorithm determines the objects that contain the query keywords.
As a straightforward approach, we introduce an inverted-index-based keyword DBSCAN (IN-DBSCAN) which combines an inverted index [15] and the original DBSCAN algorithm. An inverted index is an index structure that is used efficiently when searching for a specific word in a large database. The inverted index indexes keywords so that it can quickly retrieve information from a tuple that has a specific keyword.

Fig. 1. An example of keyword-matched DBSCAN.

In order to deal with textual information using DBSCAN effectively, we propose an efficient algorithm, referred to as a keyword-matched DBSCAN (KM-DBSCAN) algorithm. The proposed algorithm is based on an IR2-tree [16] as an index structure. To retrieve a keyword-matched object, it performs a nearest neighbor (NN) search [17] in a branch-and-bound approach [18]. While traversing an IR2-tree, KM-DBSCAN effectively prunes unqualified nodes by means of both spatial and textual information in the nodes.
To demonstrate performance and scalability, we conducted extensive experiments on real and synthetic datasets. The experimental results show that KM-DBSCAN is efficient in terms of CPU time costs.
The contributions of this paper are summarized as follows.

We introduced novel density-based clustering algorithms that deal with textual information as well as numerical or categorical data.

We developed three DBSCAN algorithms to deal with textual information: NK-DBSCAN, IN-DBSCAN, and KM-DBSCAN algorithms based on an IR2-tree.

We conducted extensive experiments on both real and synthetic datasets to show the effectiveness of KB-DBSCAN compared to NK-DBSCAN and IN-DBSCAN.

The rest of this paper is organized as follows. Section 2 reviews works related to spatial clustering algorithms. In Section 3, we define the problem addressed by our KM-DBSCAN. In Section 4, we propose the three algorithms. In Section 5, we present experimental results and their evaluation. In Section 6, we conclude our work and present some directions for future research.

Related Work

Density-based Clustering Algorithms
In this section, we give a brief overview of density-based clustering. DBSCAN is a density-based clustering algorithm [6] that can be used to identify clusters of arbitrary shapes in a data set that includes noise and outliers. The basic approach of density-based clustering is derived from an intuitive way for humans to recognize clusters. Clusters are divided into areas of high density in the data space, and areas of low-point density. The DBSCAN algorithm is based on these intuitive clusters and noise concepts. The key idea is that for each point in the cluster, a minimum point should be included around the given radius.
Shi et al. [19] designed density-based clustering places in geo-social networks to detect geo-social clusters. Their algorithm extends DBSCAN by replacing the Euclidean distance threshold for the extent of dense regions with a threshold ɛ, which considers both the spatial and the social distances between places. Chen et al. [20] proposed a novel fast density clustering algorithm (FDCA) for mixed data which determines cluster centers automatically. Wu et al. [14] proposed the extension of traditional density-based clustering for spatial locations to consider their relationship to a social network of people who visit them and the time when the locations were visited. Song and Lee [21] developed a novel parallel DBSCAN algorithm, random partitioning-DBSCAN (RP-DBSCAN), which uses pseudo-random partitioning together with a two-level cell dictionary. RP-DBSCAN simultaneously finds the local clusters for each data partition and then merges these local clusters to obtain global clustering. Up to now, most research on density-based clustering focused on either numeric or categorical data instead of textual data.
Zhang et al. [22] proposed a powerful cyclist detection and tracking method based on a multi-layer laser scanner that obtains a four-layer point cloud in a local environment. DBSCAN was utilized to segment the laser points into individual clusters for subregions. Hayat et al. [23] proposed a user prediction model for dual channels (text and call). DBSCAN is used to form clusters of communication events. Liao [24] proposed a hotspot analysis method based on spatial clustering of trajectory stops. DBSCAN algorithm capable of clustering arbitrary shapes in space with fast clustering speed, noise processing, and space was utilized in the study.

Keyword-Matching
Typical text-based data retrieval methods include inverted index and signature files. Inverted indexes have been widely used as an index structure for text searches. The inverted index consists of several inverted lists. The inverted list for a word contains the IDs of all the documents that contain the word and the frequency of those words. When a query is given, the inverted list searches the inverted lists for each word in the query and calculates the score of each document based on the frequency of the words in the documents containing the word.
The signature file is based on an inexact filter method that speeds up retrieval by first removing dissimilar documents. It generates a keyword signature for each keyword using a hashing function and generates a signature through superimposed coding. In the search phase, the search term is made up of signatures. Candidate documents with the search term are searched by using the AND operation with the search term signature and the document signature. When the search term does not exist in the candidate document, it is called a false drop. The longer the signature, the smaller the false drop.
Recently, as a method of searching for a keyword in a space, research on a mixed index structure of a text-based data retrieval method and a spatial data indexing method have been actively conducted. De Felipe et al. [16] proposed an index structure that integrates a signature file and an R-tree to enable keyword retrieval for spatial data objects with a limited number of keywords. Cong et al. [25] proposed a hybrid index structure integrating an R-tree and an inverted index. Yao et al. [26] proposed a modified version of the R-tree called MHR-tree. This structure uses a min-wise signature and a linear hashing technique. Rocha-Junior et al. [27] proposed a spatial inverted index that is a mixture of R-tree and an inverted index. The IR2-tree is a modification of R-tree and has two types of index information, which are a minimum bounding rectangle (MBR) for spatial data and a signature for text data. More specifically, a node at the lowest level (i.e., a leaf node) of the IR2-tree includes both the MBR and the term signature that tightly surrounds the object. On the other hand, the ith level node (i.e., a non-leaf node) is a signature node that overlaps the (i-1) level node’s MBR and the 〖(i-1)〗^th level signature. The i-level nodes have a signature that contains all the signatures of the subtree rooted at this node. Based on two kinds of index, the IR2-tree facilitates the filtering of unqualified nodes in terms of spatial location and textual information. An IR2-tree is used to solve the keyword-matching problem in the skyline problem [28]. We were also inspired by this solution and applied IR2-tree to DBSCAN.
In addition to general text data, text information included in an image may be utilized. Bu et al. [29] proposed a content-based image search system using a combination of the completed local binary pattern (CLBP) and color autocorrelogram.

Problem Definition

In this section, we define the keyword-matchedDBSCAN.Let𝒟beadatasetconsistingoftuples.Atupleconsists ofwhereVisaspatialdatavector, W={w1,w2,...,wt}isasetofstrings, and wj (j=1, ...,t) isakeyword.Inaddition,tpi.Vandtpi.Wrepresent a spatialdatavectorandthesetofkeywordsoftupletpi respectively.Table2 summarizes the notations used throughout this study.

Table 2. Notations

Notation Description
𝒟 the entire dataset
V spatial data vector in a tuple
W a set of keywords
t cardinality of a set of keywords
Q a set of query keywords
ɛ the radius of a neighborhood
minPtr the minimum number of neighbor tuples
Definition 1(keyword-matched tuple).Let $Q={q1,q2,…, qm}$ be a set of query keywords. A tupletpin $𝒟$ is a keyword-matched tuple for Qifand only if $∀q∈Q, q∈tp.W.$

Definition 2 (tupledistance). Given two tuples $tp_i$ and $tp_j$, the distance between $tp_i$ and $tp_j$ is defined as

$dist(tp_i,tp_j )=\sqrt{\displaystyle\sum_{k=1}^{r}(tp_i.v_k-tp_j.v_k)^2},$(1)

where r is a dimension of a spatial data vector. In this study, a tuple consists of a spatial data vector and a set of keywords. We define the distance between two tuples as the Euclidean distance. Thus, the distance between two tuples is measured using only a spatial data vector

Definition 3(ɛ-neighborhood of a tuple).An ɛ-neighborhood of tupletp,denotedby$N_ɛ$($tp$),isdefinedby{$tp'∈𝒟|dist(tp,tp') ≤ ɛ$}.

Definition 4(keyword-matchedɛ-neighborhood of a tuple).Let $Q$ be a set of query keywords and $R(Q)$ be a set of keyword-matched tuples in $𝒟$ for $Q$. A keyword-matched $ɛ$-neighborhood of a tuple, denoted by $KN_ɛ(tp, Q)$, is defined by $KN_ɛ(tp, Q)={tp'∈R(Q)|dist(tp,tp') ≤ ɛ}$.

The keyword-matched ɛ-neighborhood of a tuple is a form in which keyword matching is added to the condition of the ɛ-neighborhood of a tuple. Tuple tp is a core tuple if |$KN_ɛ(tp, Q)$| is more than the minimum number of neighbor points, $minPtr$. Tuple tp is a border tuple if $KN_ɛ(tp, Q)$ has at least one core tuple and $|KN_ɛ(tp, Q)|<minPtr$.

Definition 5(directly density-reachable). A tuple, tpi, is directly density-reachable from tuple tpj with regard to ɛ and minPtr if 1)$tpi∈KNɛ(tpj, Q)$ 2)$minPtr ≤ |KN_ɛ(tpj, Q)|$, (coretuplecondition).

Definition 6(density-reachable). Tupletpi is density-reachable from tupletpjwrt. ɛ and $minPts$ if there is a chain of tuples$tp1, ...,tpn, tp1 = tpj, tpn = tpi$ such that $tp{i+1}$ is directly density-reachable from tpi.

Definition 7(density-connected). $Tupletp_i$ is density-connected to $tupletp_jwrt$. $ɛ$ and $minPts$ if there is a $tupletp_o$ such that both, $tp_i$ and $tp_j$ are density-reachable from $tp_owrt$. ɛ and minPts.

Definition 8 (cluster). Let 𝒟be a database of tuples. A cluster Cwrt. $ɛ$ and $minPts$ is a non-empty subset of $𝒟$ satisfying the following conditions: 1) $∀tp_i, tp_j$: if $tp_i∈C$ and $tp_j$ is density-reachable from $tp_iwrt$. $ɛ$ and $minPts$, then $tp_j∈C$. (maximality) 2) $∀tpi, tpj∈C: tp_i$ is density-connected to $tp_jwrt$. $ɛ$ and $minPts$. (connectivity)

Definition 9 (noise). Let $C_1,...,C_k$ be the clusters of database𝒟wrt. $ɛ$ and $minPts$. Then we define the noise as the set of tuples in database $𝒟$ not belonging to any cluster $C_i(i=1,...,k), i.e.$, noise = {$tp∈𝒟|∀i: tp∉C_i$}.

Proposed Algorithms

As a straightforward approach, we present the NK-DBSCAN, which extends the original DBSCAN algorithm to deal with keywords. Its key idea is to consider only keyword-matched tuples in the DBSCAN algorithm. Algorithm 1 presents the NK-DBSCAN.

Algorithm 1. NaiveKeyword-based DBSCAN (NK-DBSCAN)
Input: A set 𝒟 of N tuples, ɛ, minPts, a set Q of query keywords
Output: A set 𝒟 of N labeled tuples
1. cluster id $C$ ←0
2. for each unclassified tuple $tp∊𝒟$do
3. if $tp.contain(Q)$ then
4. $KN_ɛ(tp) ← RangeQuery(tp, ɛ, Q)$
5. if $| KNɛ(tp) | ≥ minPts$ then
6. $tp.id$ ←C
7. $ExpandCluster(tp, KNɛ(tp), C, ɛ, minPts, Q)$
8.     $C ← C$ + 1
9. else
10.     Labeltpas noise
11. else
12.     Labeltpas noise
13. return $𝒟$
For each tuple not yet classified in any cluster, the algorithm determines whether the tuple has the query keyword $Q$ (line 3). When the number of keywords included in $Q$ is two or more, the meaning of contain can be seen in two ways. The first is that all elements of $Q$ are included in $tp.W, ∀q∈Q,q∈tp.W$. This function is equivalent to the AND operation in the Boolean query model [30]. The second is that $tp.W$ has at least one element in $Q, ∃q∈Q,q∈tp.W$. This function is equivalent to OR operation in the Boolean query model. Thus, contain can have two functions, and we performed experiments with the AND operation. If $tp.W$ contains $Q$, the $RangeQuery$ function is executed for the tuple; otherwise, the tuple is labeled as noise (line 4). The main difference between NK-DBSCAN and the original DBSCAN is the $RangeQuery$ function, which finds the keyword-matchedɛ-neighborhood of a tuple. Fig. 2 shows an example of the $RangeQuery$ function $wrt. minPtr = 2$. Given $Q$ = {sandwich}, assume that, first, $tp_1$ is selected from among all tuples. Since $tp_1.W$ contains “sandwich”, $RangeQuery$($tp_1, ɛ, Q$) is executed. The $RangeQuery$ function first obtains $Nɛ(tp_1)= {tp_2, tp_9, tp_10,tp_11,tp_12,tp_13}$. Because all tuples can be indexed using an R-tree for spatial data vectors, we can efficiently obtain the ɛ-neighborhood of a tuple. Second, the algorithm checks whether each tuple contains “sandwich.” Consequentially, only $tp_2$ is stored in $KN_ɛ(tp_1)$. Since the number of tuples finally returned from the $RangeQuery$ function is more than $minPtr=2$ (line 5), we assign a new cluster ID to $tp_1$ (line 6) and expand the cluster based on $tp_1$ (line 7) by calling the ExpandCluster function. So, $tp_1$ is used as a seed tuple to create a new cluster and extend the cluster.

Fig. 2. Range query in keyword-based DBSCAN.

Algorithm2 presents the $ExpandCluster$ function, which is similar to the $ExpandCluster$ function in the original DBSCAN except for $RangeQuery$. When the $ExpandCluster$ function is first called, $KN_ɛ(tp)$ contains only the keyword-matchedɛ-neighborhood seed tuple. The $RangeQuery$ function is executed based on each tuple contained in $KN_ɛ(tp)$, and it returns a new keyword-matchedɛ-neighborhood, $KN_ɛ(ntp)$ (line 4). As $KN_ɛ(ntp)$ is added to $KN_ɛ(tp), KN_ɛ(tp)$ is extended, and the cluster is extended as well (line 6). When there is no new tuple in $KN_ɛ(tp)$, the expansion of the cluster is terminated.
Algorithm 2. ExpandCluster
Input: $tp$, a set $KN_ɛ(tp)$ of keyword-based neighbor tuples of $tp$, cluster id $C, ɛ$, minPts, a set $Q$ of query keywords
1. $tp.id ←C$
2. for each tuple $ntp∊KN_ɛ(tp)$ do
3. if $ntp$ is unclassified then
4. $KN_ɛ(ntp) ← RangeQuery(ntp, ɛ, Q)$
5. if $| KNɛ(ntp) | ≥ minPts$ then
6. $KN_ɛ(tp) ← KN_ɛ(tp) ∪KN_ɛ(ntp)$
7. if ntp does not belong to any cluster then
8. $ntp,id← C$

Inverted-Index-based Keyword DBSCAN
In order to perform DBSCAN on only tuples containing query keywords, we consider a method that quickly search for tuples containing query keywords by indexing data with keyword values among the whole dataset. In this section, we introduce an IN-DBSCAN, combining an inverted index [15, 26] and DBSCAN.
An inverted index is an index structure that is used efficiently when searching for a specific word in a large database. The inverted index indexes keywords so that it can quickly retrieve information from a tuple that has a specific keyword. In order to execute DBSCAN through the inverted index, the process has the three steps shown in Fig. 3.
Step 1: Generate an inverted index.
Step 2: Search for the query keyword in the inverted index to find tuple IDs containing the query keyword. The R-tree is constructed using only those tuples.
Step 3: Perform DBSCAN using the R-tree.

Fig. 3. Inverted-index-based keyword DBSCAN with regards to query keywords; milk and sandwich.

Algorithm 3 shows the process to obtain the inverted index from the database.
Algorithm 3. MakeInvertedIndex
Input: A set of n tuples 𝒟
Output: InvertedIndex𝒳
1. Let $𝒳 is list(I1, I2, … , Ik), I=(ID, term, objectIDs)$
2. $num ← 1$
3. for each $tp∊𝒟$ do
4. $objectID← tp.ID$
5. for each $term ∊tp.W$ do
6. if $𝒳$has $term$ then
7. Add $objectID$ to $𝒳.objectIDs$ for term
8. else
9. Add $I=(num, term, objectID)$ in $𝒳$
10. $num← num+ 1$
11. return𝒳
The $MakeInvertedIndex$ function is used to create an inverted index from the database. Based on the keyword information of the tuple, the IDs of the tuples that have the keyword are recorded in the inverted index. The inverted index, $𝒳$, is composed of IDs, terms indicating keywords, and $objectIDs$ storing IDs of tuples that have keywords. There are two ways to add a new element, I, to $𝒳$. (1) When the keyword of the tuple exists in $𝒳$, only the ID of the tuple is added to the $objectIDs$ (line 7). (2) If there is no keyword in $𝒳$, the element with information on the ID of the tuple with the new ID, keyword, and objectID is added to the list.
The inverted index is used to retrieve the ID of the tuple contained in the query keyword, and the R-tree is constructed only by the retrieved tuples. IN-DBSCAN performs DBSCAN using the R-tree constructed through this process.

Fig. 4. IR2-tree for example dataset.

IR2-Tree based Keyword DBSCAN
In this section, we present the proposed KM-DBSCAN algorithm using an IR2-tree. In order to store data consisting of both spatial information and textual descriptions, KM-DBSCAN makes use of an IR2-tree [16]. For each tuple, the value vector is indexed by a minimum bounding rectangle, and its keywords are indexed by signature. Fig. 4 shows the structure of the IR2-tree for the example dataset (Fig. 5). We consider the value vector of a tuple as a point in d-dimensional space. For each tuple, the value vector is indexed by an MBR, and its keywords are indexed by signature.
In an IR2-tree, each node stores the signature representing the MBR coordinate information and keywords as bit information. DBSCAN based on an IR2-tree searches nodes that meet the condition by additionally checking the following conditions in the existing DBSCAN to execute the clustering.

$checkSig(δ,τ) \begin{cases} true &if δ=(δ∧τ)\cr false &otherwise \end{cases}$

Fig. 5. A 2D tuples and MBRs.

Algorithm 4. MakeSiganture
Input: a set Q of query keywords
Output: Signature Sig
1. Let $Sigis Bit Array[0…L-1] of L$ length
2. Let $KeySigis Bit Array[0…L-1] of L$ length
3. Initialzie $Sig[0…L-1] = 0$
4. foreach $term∊Q$ do
5. $Value← 0$
6. $p ← 31$
7. $m ← 109 + 9$
8. $pow ← 1$
9. foreachch $aracter∊term$ do
10. $Value ← Value +character × pow mod m$
11. $pow← pow × p$
12. Initialzie $KeySig[0…L-1] = 0$
13. foreach $i = 0 tobitNum$ do
14. $randomValue← PRNG(Value)$
15. $bit←randomValue$mod$L$
16. $KeySig[bit] ← 1$
17. $i← i + 1$
18.   $Sig← Sig | KeySig$
19. return$Sig$
The MakeSiganture creates a keyword signature of length L. A hash function maps each keyword to an arbitrary integer value. Random numbers are generated by seeding the integer value (lines 5–12). We use the polynomial hash function, which returns a string as an integer. Thepolynomial hash function for string S of length N is defined as follows:

$(S[0] + S[1]·p+S[2]·p^2+⋯+S[N-1]·p^{N-1})modm$

where $p$ and $m$ are positive primes. In our study, p is set to 31, which is known to have a low probability of collision of hash function values, and m is set to 109 + 9 considering that the maximum size of an integer data type is 8 bytes (64 bits). For one keyword, the ASCII code value of each character constituting the keyword enters the S array for each character. Given one keyword “car,” the ASCII code value of each character is entered in the S array, i.e.,S[0] = 99(‘c’), S[1] = 97(‘a’) and S[2] = 114(‘r’).
The keyword signature, $KeySig$, isset to 1 based on the generated random number (lines 14–18). Pseudo-random number generator (PRNG) represents an algorithm for generating random sequences using mathematical formulas. The PRNG generates a series of numbers similar to the properties of random numbers (line 15). The keyword signatures generated for each keyword are OR’ed together to generate the node signature, $Sig$ (line 19).
KM-DBSCAN operates in the same way as the general DBSCAN,butthe difference is that the signature-pruning process is added to the range query. Algorithm 5 showsthesearch process using the signature of the IR2-tree.

Algorithm 5. IRangeQuery
Input: a tuple tp, $ɛ$, a set $Q$ of query keywords, IR2-tree t
Output: aneighborhood set $N$
1. Init $N$//initializing empty set
2. Init $H$// initializing empty heap
3. $H.add(t.root)$
4. $δ ← MakeSignature(Q)$
5. while $H$ is not empty do
6. $e←H.remove()$ // remove a top entry
7. $minDis$←minimum distance between $tp$ and $e$
8. $τ$← Siganture of $e$
9.   if $minDis≤ ɛandcheckSig(δ,τ)$ then
10.     if $e$ is non-leaf then
11.       for each $child$ in $e$ do
12. $minDis$←minimum distance between tp and $child$
13. $τ$← Siganture of $child$
14.         if $minDis≤ɛ$ and $checkSig(δ,τ)$ then
15. $H.add(child)$
16.     else // if e is a leaf
17.       for each $data$ in $e$ do
18. minDis←minimum between tp and data
19. $τ$← Siganture of $data$
20.         if $minDis≤ɛandcheckSig(δ,τ)$ then
21. $N.add(data)$
22. return$N$
$N$ and $H$ are initialized to empty values (line 1–2). N stores the keyword-matched ɛ-neighborhoods of the query keywords $Q$, KNɛ($tp, Q$), and $H$ stores the nodes to be searched in the IR2-tree. The search for keyword-matched $ɛ$-neighborhoods is started by inserting the root node into $H$ (line 3). MakeSignature function returns the signature of the query keywords $Q$ and stores it in $δ$ (line 4). Algorithm 5 is executed until $H$ is completely empty (line 5). First, the node is removed from $H$ that has the highest priority, and is returned to $e$ (line 6). The minimum distance between the query tuple and e is stored in $minDis$, and the signature of the node eis stored in $τ$ (line 7–8). If $minDis$ is greater than $ɛ$ or the return value of $checkSig(τ, δ$) is false, the search to find the keyword-matched ɛ-neighborhoods of the query tuple $tp$ in the e node stops. Because the node e does not have the keyword-matched ɛ-neighborhoods of the query tuple $tp$ (line 9). Otherwise, the search to find the keyword-matched $ɛ$-neighborhoods of the query tuple tp continues (line 10–21). If the node e is non-leaf, for each child node in $e$, the minimum distance between the node and the query tuple tp is calculated and stored in $minDis$, and the node’s signature is stored in τ (line 12–13). If $minDis$ is less than or equal to ɛ and the return value of $checkSig(τ, δ)$ is true, the node $e$ has the keyword-matched $ɛ$-neighborhoods. Thus, the child node is added to $H$ (line 14–15). If the node e is a leaf node, each tuple contained in e is examined whether it is the keyword-matched ɛ-neighborhood of query tuples $tp$ (line 16–17). For each tuple in the node $e$, the distance between the tuple and the query tuple $tp$ is calculated and stored in $Dis$, and the tuple’s signature is stored in $τ$ (line 18-19). If $Dis$ is less than or equal to ɛ and the return value of $checkSig(τ, δ)$ is true, the tuple is the keyword-matched $ɛ$-neighborhoods; thus the tuple is added to $N$ (line 20–21).

Cost evaluation
An R-tree is a representative index used for storing spatial data. Since DBSCAN also processes spatial data, an R-tree is used to store data. A time complexity of an R-tree is $O(log_mn)$, $m$ is the maximum number of entries in one node and $n$ is the number of data. Therefore, the cost ofɛ-neighborhood query is also $O(log_mn)$. However, since an R-tree has no structure that can represent a set of keywords for an object, it must be searched up to leaf nodes when executing the ɛ-neighborhood query to access the keywords owned by the object. An IR2-tree is basically based on an R-tree, with the difference that each node has a signature. Thus, a time complexity of an IR2-tree is $O(log_mn)$. In an IR2-tree, however, the leaf node contains a signature that summarizes the keyword set of all objects contained in that node. The internal node has signatures that incorporate the signatures of all subnodes. Using the signatures of internal nodes, the ɛ-neighborhood can detect nodes that do not have query keywords in the middle of the search process. In other words, the ɛ-neighborhood query does not need to search to the leaf node, which reduces the search space.

Experiments

In this section, we evaluate the performance of our proposed three DBSCAN algorithms (NK-DBSCAN, IN-DBSCAN, and KM-DBSCAN) by varying several parameters. We used real and synthetic datasets in the experiments. We use variouscardinalities $N$, various signature lengths $𝓁$, various numbers of query keywords $𝓀$,radiusofneighborhoodɛ, andminimumnumberofneighbortuplesminPtr. Table 3 lists the system parameters we used in our experiments,anddefaultvaluesareshowninboldface. In each experiment, we varied asingleparameterwhiletheother parameters were set to their default values.

Table 3. System parameters

Parameter Value
Cardinality of tuples ($N$) 100K, 200K, 400K, 600K, 1M
Signature length ($𝓁$) 64, 128, 192, 256, 320, 384, 448, 512, 640, 768
Number of query keywords ($𝓀$) 1, 2, 3, 4, 5
Radius of a neighborhood ($ɛ$) 10, 20, 30, 40, 50
Minimum number of neighbor tuples ($minPtr$) 3, 5, 10, 15, 20
Distribution (for spatial data vectors) Uniform, Gaussian
Bold text indicates the default value.

All the proposed algorithms were implemented in Java, and the experiments were conducted on a Linux (Ubuntu 14.04) platform using an Intel Corei7 at 3.50 GHz with 32 GB memory. The page size of a node on the IR2-tree was matched to the disk block size (i.e., 4096 bytes).

Datasets
We used real and synthetic datasets. For the real dataset, we collected sets of items from data.world (https://data.world/datasets/geospatial). Each tuple had longitude and latitude coordinates as spatial data. The data collected was slightly modified for our experimental purposes. Because the real dataset was highly skewed, we normalized spatial data into the space (0, 0) × (5K, 5K). In order to evaluate the effects of various parameters, we generated two types of synthetic datasets. One has spatial data with uniform distribution in the space, and the other has spatial data with Gaussian distribution. Our dataset has various cardinalities N in the range [100K,1M], signature lengths 𝓁 in the range[64,768]. The spatialdatavector space is (0, 0) × (5K, 5K). Each tuple is described by eightkeywords.

Performance measurement
In all the experiments, we focus on CPU time. For all the algorithms, we measure their performance using elapsed time. Elapsed time is the total execution time of the algorithm. However,theonlyelapsedtime for IN-DBSCANincludes IR2-tree construction time. The reason is that IN-DBSCAN have to construct a new R-tree whenever query keywords are given.

Effect of Scalability
In this section, we evaluate scalability with respect to the cardinality N of the dataset. We conduct experiments with various cardinalities N[100K,1M], andset𝓁 = 768, and$𝓀$ = 2. Fig.6 shows the experiments for varying $N$. As shown in Fig. 6, KM-DBSCAN generally outperforms NK-DBSCAN and IN-DBSCAN with all the datasets. The performance difference increases as the cardinality increases. This shows that the KM DBSCAN algorithm performs effectively for a large dataset. Specifically, the difference between KM-DBSCAN and the others in the case of the real dataset is greater than the difference with the synthetic datasets. This is because KM-DBSCAN reduces the number of accessing nodes by the signature of the IR2-tree.

Fig. 6. Effect of varying cardinality N: (a) uniform distribution in synthetic dataset, (b) Gaussian distribution in synthetic dataset, and (c) real data.

Effect of the Number of Query Keywords
In order to study the effect of the number of query keywords, we vary the number of query keywords 𝓀 within the range [1,5] when N is 1M and 𝓁 = 768. In this experiment, KM-DBSCN generally outperforms NK-DBSCAN and IN-DBSCAN on both synthetic and real datasets. Fig. 7 shows that the difference between KM-DBSCAN and the others decreases when increasing the number of query keywords, especially with the synthetic dataset. This is because we found tuples with all the keywords given as query keywords in the experiment, and with more of the given query keywords, more entries that matched the query signature were not pruned early.

Effect of Signature Length
In this experiment, we study the effect of the signature length in the IR2-tree. The signature used in an IR2-tree stores the result of the OR operation of all the bits of the child node. Therefore, the higher the node, the greater the number of 1 bit value in the signature. If the number of 1 bit value in a higher node is large, the effect of pruning using the signature of the query keyword is reduced. If you increase the size of the signature to increase pruning performance, the number of nodes that make up the IR2-tree increases. The reason is that we fixed the size of the node to the disk block size (i.e., 4,096 bytes); that is, long signature lengths improve pruning performance. However, increasing the signature length increases the index size. It is important to find the optimal signature length for the given data or purpose.
In order to find the effect of the length of the signature, we vary the length of signature 𝓀 to within the range [64,768] when N is 1M and 𝓀 = 2. Fig. 8(a) shows that the CPU time in KM-DBSCAN, IN-DBSCAN, and NK-DBSCAN decreases with increasing lengths of the signature. In the real data, the experiment when 512 ≤ 𝓁shows that there is little difference in the degree of decrease in CPU time.Figs. 8(b) shows that the number of nodes increases as the signature length increases. Uniform and Gaussian distributions have the same number of nodes regardless of their distribution.

Fig. 7. Effect of varying the number of query keywords: (a) uniform distribution in the synthetic dataset, (b) Gaussian distribution in the synthetic dataset, and (c) real data.

Fig. 8. Effect of varying the length of signature: (a) length of signature vs. processing cost and (b) length of the signature vs. the number of nodes in the IR2-tree.

Effect of minPts
In this section, we evaluate scalability with respect to the number of minPts. We conduct experiments with various minPts[2,4,8,16,32], andset𝓁 = 768,and𝓀 = 2. Fig. 9 shows the experiments for varying size of minPts. As shown in Fig. 9, KM-DBSCAN generally outperforms NK-DBSCAN and IN-DBSCAN with all the datasets. NK-DBSCAN is not sensitive to the number of minPtsthan other algorithms. For NK-DBSCAN, the number of minPts does not significantly affect performance. This is because increasing a size of minPts does not reduce the operation to find ɛ-neighborhood of a tuple in NK-DBSCAN. The performance of KM-DBSCAN and IN-DBSCAN improves as a size of minPts increases. This is because the number of cells with the core tuple decreases as a size of minPts increases, thus reducing the operation of the merging step.

Fig. 9. Effect of varying minPts: (a) uniform distribution in the synthetic dataset, (b) Gaussian distribution in the synthetic dataset, and (c) real data.

Effect of ɛ
To confirm the effects of the size of ɛ, we vary the number of cells from 10 to 50. Other parameters are given their baseline values (Table 3). Fig. 10 shows the effect of the size of ɛ. For each algorithm, the runtime is approximately proportional to the size of ɛ. KM-DBSCAN outperforms IN-DBSCAN and NK-DBSCAN. Even if the size of ɛ increases, the difference between the execution time of KM-DBSCAN and IN-DBSCAN is kept almost constant. The execution time of NK-DBSCAN increases significantly as the size of ɛ increases. This is because the number of MBRs included in ɛ increases as the size of ɛ increases. When searching many MBRs in the R-tree, the execution speed is greatly increased.

Fig. 10. Effect of varying epsilon: (a) uniform distribution in the synthetic dataset, (b) Gaussian distribution in the synthetic dataset, and (c) real data.

Conclusion

This paper defines the keyword-matched DBSCAN problem and the proposed three algorithms. To solve the keyword-matched DBSCAN problem, we first proposed NK-DBSCAN using simple pruning on an R-tree. Then, we described IN-DBSCAN on inverted indexing-only keyword data. Finally, we presented KM-DBSCAN on an IR2-tree that maintained additional keywords information of tuples as signatures. KM-DBSCAN retrieves keyword-matched tuples by using a nearest-neighbor search technique on an IR2-tree. While traversing the IR2-tree, KM-DBSCAN effectively prunes nodes that do not contain query keywords by means of signatures stored in IR2-tree nodes. In order to evaluate the performance of our algorithms, we conducted some experiments in various settings. The experimental results showed that KM-DBSCAN is effective in improving computational performance. In future work, we plan to extend our proposal to deal with not only uniform and Gaussian distribution but also skewed data distribution and to consider an optimized keyword-matched DBSCAN algorithm in a parallel processing environment.

Acknowledgements

Not applicable.

Author’s Contributions

Conceptualization, BK, JJ. Funding acquisition, BK. Investigation and methodology, BK. Project administration, BK. Resources, JJ. Supervision, JJ. Writing of the original draft, BK. Writing of the review and editing, BK, JJ. Software, JJ. Validation, BK. Data Curation, BK. Visualization, BK, JJ.

Funding

This work was supported by the Dongguk University Research Fund of 2019 and by the Research Grant of Jeonju University in 2021. This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Korea government (MSIT) (No. 2020R1F1A1077369) and by the Korea government (MSIT) (No. 2021R1F1A1049387).

Competing Interests

The authors declare that they have no competing interests.

Author Biography

Name : Hong-Jun Jang
Affiliation : Jeonju University
Biography : He received the B.Sc. and Ph.D. degrees in computer science education from Korea University, Seoul, South Korea. His research interests include data mining, machine learning, and data base.

Name : Byoungwook Kim
Affiliation : Dongguk University Gyeongju Campus
Biography : He received the B.Sc. and Ph.D. degrees in computer science education from Korea University, Seoul, South Korea. Since 2018, he has been with the Department of Computer Engineering, Dongguk University, Gyeongju, South Korea. He had been with the Creative Informatics and Computing Institute, Korea University, Seoul. He is currently a Research Fellow with the New Media Education and Development Center, Korea National Open University. His research interests include data mining, machine learning, intelligent tutoring systems, and e-learning systems.

References

[1] Q. Liu, M. Deng, Y. Shi, and J. Wang, “A density-based spatial clustering algorithm considering both spatial proximity and attribute similarity,” Computers & Geosciences, vol. 46, pp. 296-309, 2012.
[2] J. Sander, M. Ester, H. P. Kriegel, and X. Xu, “Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 169-194, 1998.
[3] V. Estivill-Castro and I. Lee, “Multi-level clustering and its visualization for exploratory spatial analysis,” GeoInformatica, vol. 6, no. 2, pp. 123-152, 2002.
[4] Z. Li, Algorithmic Foundation of Multi-Scale Spatial Representation. Boca Raton, FL: CRC Press, 2006.
[5] H. J. Miller and J. Han, Geographic Data Mining and Knowledge Discovery. Boca Raton, FL: CRC Press, 2009.
[6] M. Ester, H. P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD), Portland, OR, 1996, pp. 226-231.
[7] H. J. Jang, K. S. Hyun, J. Chung, and S. Y. Jung, “Nearest base-neighbor search on spatial datasets,” Knowledge and Information Systems, vol. 62, no. 3, pp. 867-897, 2020.
[8] D. W. Choi and C. W. Chung, “Nearest neighborhood search in spatial databases,” in Proceedings of 2015 IEEE 31st International Conference on Data Engineering, Seoul, Korea, 2015, pp. 699-710.
[9] D. Zhang, C. Y. Chan, and K. L. Tan, “Nearest group queries,” in Proceedings of the 25th International Conference on Scientific and Statistical Database Management, Baltimore, MD, 2013, pp. 1-12.
[10] H. J. Jang, B. Kim, J. Kim, and S. Y. Jung, “An efficient grid-based k-prototypes algorithm for sustainable decision-making on spatial objects,” Sustainability, vol. 10, no. 8, article no. 2614, 2018. https://doi.org/10.3390/su10082614
[11] H. P. Kriegel, P. Kroger, J. Sander, and A. Zimek, “Density‐based clustering,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1, no. 3, pp. 231-240, 2011.
[12] C. Dharni and M. Bnasal, “An improvement of DBSCAN Algorithm to analyze cluster for large datasets,” in Proceedings of 2013 IEEE international conference in MOOC, innovation and technology in education (MITE), Jaipur, India, 2013, pp. 42-46.
[13] D. Wu, J. Shi, and N. Mamoulis, “Density-based place clustering using geo-social network data,” IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 5, pp. 838-851, 2017.
[14] T. O'Reilly, “What is Web 2.0: design patterns and business models for the next generation of software,” 2007 [Online]. Available: https://mpra.ub.uni-muenchen.de/4578/1/MPRA_paper_4578.pdf.
[15] J. Zobel, A. Moffat, and K. Ramamohanarao, “Inverted files versus signature files for text indexing,” ACM Transactions on Database Systems (TODS), vol. 23, no. 4, pp. 453-490, 1998.
[16] I. De Felipe, V. Hristidis, and N. Rishe, “Keyword search on spatial databases,” in Proceedings of 2008 IEEE 24th International Conference on Data Engineering, Cancun, Mexico, 2008, pp. 656-665.
[17] N. Roussopoulos, S. Kelley, and F. Vincent, “Nearest neighbor queries,” in Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, CA, 1995, pp. 71-79.
[18] G. R. Hjaltason and H. Samet, “Distance browsing in spatial databases,” ACM Transactions on Database Systems (TODS), vol. 24, no. 2, pp. 265-318, 1999.
[19] J. Shi, N. Mamoulis, D. Wu, and D. W. Cheung, “Density-based place clustering in geo-social networks,” in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, UT, 2014, pp. 99-110.
[20] J. Chen, H. He, J. Chen, S. Yu, and Z. Shi, “Fast density clustering algorithm for numerical data and categorical data,” Mathematical Problems in Engineering, vol. 2017, article no. 6393652, 2017. https://doi.org/10.1155/2017/6393652
[21] H. Song and J. G. Lee, “RP-DBSCAN: a superfast parallel DBSCAN algorithm based on random partitioning,” in Proceedings of the 2018 International Conference on Management of Data, Houston, TX, 2018, pp. 1173-1187.
[22] S. Hayat, A. Rextin, A. Idris, and M. Nasim, “Text and phone calls: user behaviour and dual-channel communication prediction,” Human-centric Computing and Information Sciences, vol. 10, article no. 11, 2020. https://doi.org/10.1186/s13673-020-00217-x
[23] M. Zhang, R. Fu, Y. Guo, L. Wang, P. Wang, and H. Deng, “Cyclist detection and tracking based on multi-layer laser scanner,” Human-centric Computing and Information Sciences, vol. 10, article no. 20, 2020. https://doi.org/10.1186/s13673-020-00225-x
[24] Y. Liao, “Hot spot analysis of tourist attractions based on stay point spatial clustering,” Journal of Information Processing Systems, vol. 16, no. 4, pp. 750-759, 2020.
[25] G. Cong, C. S. Jensen, and D. Wu, “Efficient retrieval of the top-k most relevant spatial web objects,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 337-348, 2009.
[26] B. Yao, F. Li, M. Hadjieleftheriou, and K. Hou, “Approximate string search in spatial databases,” in Proceedings of 2010 IEEE 26th International Conference on Data Engineering (ICDE), Long Beach, CA, 2010, pp. 545-556.
[27] J. B. Rocha-Junior, O. Gkorgkas, S. Jonassen, and K. Norvag, “Efficient processing of top-k spatial keyword queries,” in Advances in Spatial and Temporal Databases. Heidelberg, Germany: Springer, 2011, pp. 205-222.
[28] H. Choi, H. Jung, K. Y. Lee, and Y. D. Chung, “Skyline queries on keyword-matched data,” Information Sciences, vol. 232, pp. 449-463, 2013.
[29] H. H. Bu, N. C. Kim, B. J. Yun, and S. H. Kim, “Content-based image retrieval using multi-resolution multi-direction filtering-based CLBP texture features and color autocorrelogram features,” Journal of Information Processing Systems, vol. 16, no. 4, pp. 991-1000, 2020.
[30] W. B. Frankes and R. Baeza-Yates, Information Retrieval: Data Structures & Algorithms. Upper Saddle River, NJ: Prentice Hall, 1992.

Hong-Jun Jang1 and Byoungwook Kim2,*, KM-DBSCAN: Density-Based Clustering of Massive Spatial Data with Keywords, Article number: 11:43 (2021) Cite this article 1 Accesses