- HOMEHOME
- ABOUTAims and scope Editorial Board Contact
- ARTICLESCurrent Issue All Issue
- SUBMISSIONPreparing a Manuscript Peer-review Policy Peer-Review Process Article Processing Charges (APC) Submission Track Language and Editing Services Prepare Supporting Information Editorial Policies Copyright

- Hong-Jun Jang
^{1}and Byoungwook Kim^{2,*}

Human-centric Computing and Information Sciences volume **11**, Article number: 43 (2021)

Cite this article **1 Accesses**

https://doi.org/10.22967/HCIS.2021.11.043

Abstract

Density-based spatial clustering is one of the most famous algorithms for spatial clustering. Most clustering studies regarding the density-based spatial clustering of applications with noise (DBSCAN) have focused on efficiently finding groups for spatial objects with numeric data or categorical data. In this paper, we propose a novel density-based spatial clustering algorithm called keyword-matched DBSCAN (KM-DBSCAN), which deals with textual data as well as numerical or categorical data. KM-DBSCAN utilizes an IR^{2}-tree as an index structure. To find keyword-matched objects in large datasets, it performs a nearest neighbor search using a branch-and-bound method. While traversing the IR^{2}-tree, KM-DBSCAN effectively prunes unqualified nodes by utilizing both spatial and textual information in nodes. To demonstrate the efficiency of KM-DBSCAN, we conducted extensive experiments in various settings. The experimental results show that KM-DBSCAN is very efficient in terms of computational cost.

Keywords

Spatial Clustering, DBSCAN, Textual Information, IR^{2}-Tree

Introduction

Spatial clustering, which groups a set of spatial objects into groups called clusters, is one of the key technologies for spatial data mining and spatial data analysis [1]. Spatial clustering ensures that spatial objects in the same cluster are similar to each other and dissimilar from those in other clusters. Recently, with the development of the geographic information system, there is increasing interest in, and research on, the development of new application services such as location-based services and telematics providing a route search, surrounding information searches, emergency services, etc. Spatial clustering is also widely used in the field of spatial data analysis such as land-use detection [2], crime hot-spot analysis [3], cartographic generalization [4], geographic customer segmentation [5], and earthquake analysis [6]. As the technology of finding useful knowledge in spatial data becomes more important, a variety of research on finding groups has been conducted, such as nearest base-neighbor searches on spatial datasets [7], nearest neighborhood searches in spatial databases [8], nearest group queries [9], and the grid-based K-prototypes algorithm [10].

Density-based spatial clustering of applications with noise (DBSCAN) clustering is one of the most famous algorithms for spatial clustering [11–13]. DBSCAN requires the input of two parameters ɛ and minPtr. The neighborhood within radius ɛ of a given objects is called the ɛ-neighborhood of the object and an object with at least minPtr object within its ɛ-neighborhood is called a core object. DBSCAN can find clusters of variable sizes and shapes, and it will also detect noise. Most clustering studies into DBSCAN have focused on efficiently finding groups for spatial objects with numeric or categorical data. However, many real-world spatial objects have textual information, not just numeric data or categorical data. Hence, if the textual information affects spatial clustering results, the error value can increase when the final cluster results are evaluated. Table 1 shows an example of the data we are going to deal with. Spatial data have location information (latitude, longitude) for each shop, and textual information about the shop. The text information in the spatial data can be a variety of text descriptions obtained from newspaper articles, personal blogs, and social networking services. For simplicity, we assume that a textual description is a set of keywords such as tags [14].

**Table 1.** An example of spatial data with keywords

ID | Type | Latitude | Longitude | Textual information |
---|---|---|---|---|

1 | bakery | 14 | 6 | coffee, sandwich, milk |

2 | restaurant | 12 | 3 | steak, pasta |

3 | bakery | 30 | 4 | cookie, ice cream |

4 | book store | 31 | 9 | book, snack |

5 | bakery | 59 | 9 | coffee, ice cream |

6 | bakery | 12 | 8 | coffee, sandwich, milk |

7 | restaurant | 12 | 3 | sandwich, steak, pasta |

8 | clothing shop | 30 | 4 | suits, necktie |

9 | book store | 31 | 9 | book, milk |

10 | bakery | 59 | 9 | cookie, ice cream |

The data in Table 1 consists of the type of shop, the spatial data in which the shop is located, and the menu item data for each shop. In this case, we use as keywords the items that each shop sells. This data can be used in market area analysis, e.g., selecting the most effective advertising locations or the setting up of a new shop. For example, suppose a user wants to open a sandwich shop in an area where there are many other shops. To avoid overheated competition, it is reasonable for the user to want to choose an area with fewer sandwich shops. Moreover, not only should the area have no other sandwich shops, there should be many shops in the area that are often frequented by potential customers. Therefore, the user prefers large areas of high density for high sales, with no other sandwich shops in the area. Fig. 1 shows an example of density-based clustering on spatial data using keywords. Considering the location data only, the areas where restaurants are concentrated are a1, a2, a3and a4. Using the items shops have for sale (provided as text information), the areas where shops are crowded and where no shops sell sandwiches are a2 and a4. As such, if a concentrated area is searched for using keywords, a detailed area suitable for the user’s intention can be found.

Intuitively, this problem can be solved using a combination of a keyword search algorithm and the original DBSCAN algorithm referred to as a naïve keyword-based DBSCAN (NK-DBSCAN) algorithm. This combination approach first executes the original DBSCAN algorithm using only spatial data. With RangeQuery in the DBSCAN algorithm, this algorithm determines the objects that contain the query keywords.

As a straightforward approach, we introduce an inverted-index-based keyword DBSCAN (IN-DBSCAN) which combines an inverted index [15] and the original DBSCAN algorithm. An inverted index is an index structure that is used efficiently when searching for a specific word in a large database. The inverted index indexes keywords so that it can quickly retrieve information from a tuple that has a specific keyword.

In order to deal with textual information using DBSCAN effectively, we propose an efficient algorithm, referred to as a keyword-matched DBSCAN (KM-DBSCAN) algorithm. The proposed algorithm is based on an IR

To demonstrate performance and scalability, we conducted extensive experiments on real and synthetic datasets. The experimental results show that KM-DBSCAN is efficient in terms of CPU time costs.

The contributions of this paper are summarized as follows.

We introduced novel density-based clustering algorithms that deal with textual information as well as numerical or categorical data.

We developed three DBSCAN algorithms to deal with textual information: NK-DBSCAN, IN-DBSCAN, and KM-DBSCAN algorithms based on an IR^{2}-tree.

We conducted extensive experiments on both real and synthetic datasets to show the effectiveness of KB-DBSCAN compared to NK-DBSCAN and IN-DBSCAN.

Related Work

Density-based Clustering Algorithms

In this section, we give a brief overview of density-based clustering. DBSCAN is a density-based clustering algorithm [6] that can be used to identify clusters of arbitrary shapes in a data set that includes noise and outliers. The basic approach of density-based clustering is derived from an intuitive way for humans to recognize clusters. Clusters are divided into areas of high density in the data space, and areas of low-point density. The DBSCAN algorithm is based on these intuitive clusters and noise concepts. The key idea is that for each point in the cluster, a minimum point should be included around the given radius.Shi et al. [19] designed density-based clustering places in geo-social networks to detect geo-social clusters. Their algorithm extends DBSCAN by replacing the Euclidean distance threshold for the extent of dense regions with a threshold ɛ, which considers both the spatial and the social distances between places. Chen et al. [20] proposed a novel fast density clustering algorithm (FDCA) for mixed data which determines cluster centers automatically. Wu et al. [14] proposed the extension of traditional density-based clustering for spatial locations to consider their relationship to a social network of people who visit them and the time when the locations were visited. Song and Lee [21] developed a novel parallel DBSCAN algorithm, random partitioning-DBSCAN (RP-DBSCAN), which uses pseudo-random partitioning together with a two-level cell dictionary. RP-DBSCAN simultaneously finds the local clusters for each data partition and then merges these local clusters to obtain global clustering. Up to now, most research on density-based clustering focused on either numeric or categorical data instead of textual data.

Zhang et al. [22] proposed a powerful cyclist detection and tracking method based on a multi-layer laser scanner that obtains a four-layer point cloud in a local environment. DBSCAN was utilized to segment the laser points into individual clusters for subregions. Hayat et al. [23] proposed a user prediction model for dual channels (text and call). DBSCAN is used to form clusters of communication events. Liao [24] proposed a hotspot analysis method based on spatial clustering of trajectory stops. DBSCAN algorithm capable of clustering arbitrary shapes in space with fast clustering speed, noise processing, and space was utilized in the study.

Keyword-Matching

Typical text-based data retrieval methods include inverted index and signature files. Inverted indexes have been widely used as an index structure for text searches. The inverted index consists of several inverted lists. The inverted list for a word contains the IDs of all the documents that contain the word and the frequency of those words. When a query is given, the inverted list searches the inverted lists for each word in the query and calculates the score of each document based on the frequency of the words in the documents containing the word.The signature file is based on an inexact filter method that speeds up retrieval by first removing dissimilar documents. It generates a keyword signature for each keyword using a hashing function and generates a signature through superimposed coding. In the search phase, the search term is made up of signatures. Candidate documents with the search term are searched by using the AND operation with the search term signature and the document signature. When the search term does not exist in the candidate document, it is called a false drop. The longer the signature, the smaller the false drop.

Recently, as a method of searching for a keyword in a space, research on a mixed index structure of a text-based data retrieval method and a spatial data indexing method have been actively conducted. De Felipe et al. [16] proposed an index structure that integrates a signature file and an R-tree to enable keyword retrieval for spatial data objects with a limited number of keywords. Cong et al. [25] proposed a hybrid index structure integrating an R-tree and an inverted index. Yao et al. [26] proposed a modified version of the R-tree called MHR-tree. This structure uses a min-wise signature and a linear hashing technique. Rocha-Junior et al. [27] proposed a spatial inverted index that is a mixture of R-tree and an inverted index. The IR

In addition to general text data, text information included in an image may be utilized. Bu et al. [29] proposed a content-based image search system using a combination of the completed local binary pattern (CLBP) and color autocorrelogram.

Problem Definition

In this section, we define the keyword-matchedDBSCAN.Let𝒟beadatasetconsistingoftuples.Atupleconsists of

**Table 2.** Notations

Notation | Description |
---|---|

𝒟 | the entire dataset |

V | spatial data vector in a tuple |

W | a set of keywords |

t | cardinality of a set of keywords |

Q | a set of query keywords |

ɛ | the radius of a neighborhood |

minPtr | the minimum number of neighbor tuples |

$dist(tp_i,tp_j )=\sqrt{\displaystyle\sum_{k=1}^{r}(tp_i.v_k-tp_j.v_k)^2},$*(1)*

The keyword-matched ɛ-neighborhood of a tuple is a form in which keyword matching is added to the condition of the ɛ-neighborhood of a tuple. Tuple tp is a core tuple if |$KN_ɛ(tp, Q)$| is more than the minimum number of neighbor points, $minPtr$. Tuple tp is a border tuple if $KN_ɛ(tp, Q)$ has at least one core tuple and $|KN_ɛ(tp, Q)|<minPtr$.

Proposed Algorithms

As a straightforward approach, we present the NK-DBSCAN, which extends the original DBSCAN algorithm to deal with keywords. Its key idea is to consider only keyword-matched tuples in the DBSCAN algorithm. Algorithm 1 presents the NK-DBSCAN.

2.

3.

4. $KN_ɛ(tp) ← RangeQuery(tp, ɛ, Q)$

5.

6. $tp.id$ ←C

7. $ExpandCluster(tp, KNɛ(tp), C, ɛ, minPts, Q)$

8. $C ← C$ + 1

9.

10. Labeltpas noise

11.

12. Labeltpas noise

13.

Algorithm2 presents the $ExpandCluster$ function, which is similar to the $ExpandCluster$ function in the original DBSCAN except for $RangeQuery$. When the $ExpandCluster$ function is first called, $KN_ɛ(tp)$ contains only the keyword-matchedɛ-neighborhood seed tuple. The $RangeQuery$ function is executed based on each tuple contained in $KN_ɛ(tp)$, and it returns a new keyword-matchedɛ-neighborhood, $KN_ɛ(ntp)$ (line 4). As $KN_ɛ(ntp)$ is added to $KN_ɛ(tp), KN_ɛ(tp)$ is extended, and the cluster is extended as well (line 6). When there is no new tuple in $KN_ɛ(tp)$, the expansion of the cluster is terminated.

2.

3.

4. $KN_ɛ(ntp) ← RangeQuery(ntp, ɛ, Q)$

5.

6. $KN_ɛ(tp) ← KN_ɛ(tp) ∪KN_ɛ(ntp)$

7.

8. $ntp,id← C$

Inverted-Index-based Keyword DBSCAN

In order to perform DBSCAN on only tuples containing query keywords, we consider a method that quickly search for tuples containing query keywords by indexing data with keyword values among the whole dataset. In this section, we introduce an IN-DBSCAN, combining an inverted index [15, 26] and DBSCAN.An inverted index is an index structure that is used efficiently when searching for a specific word in a large database. The inverted index indexes keywords so that it can quickly retrieve information from a tuple that has a specific keyword. In order to execute DBSCAN through the inverted index, the process has the three steps shown in Fig. 3.

Step 1: Generate an inverted index.

Step 2: Search for the query keyword in the inverted index to find tuple IDs containing the query keyword. The R-tree is constructed using only those tuples.

Step 3: Perform DBSCAN using the R-tree.

Algorithm 3 shows the process to obtain the inverted index from the database.

2. $num ← 1$

3.

4. $objectID← tp.ID$

5.

6.

7. Add $objectID$ to $𝒳.objectIDs$ for term

8.

9. Add $I=(num, term, objectID)$ in $𝒳$

10. $num← num+ 1$

11.

The inverted index is used to retrieve the ID of the tuple contained in the query keyword, and the R-tree is constructed only by the retrieved tuples. IN-DBSCAN performs DBSCAN using the R-tree constructed through this process.

IR^{2}-Tree based Keyword DBSCAN

In this section, we present the proposed KM-DBSCAN algorithm using an IRIn an IR

$checkSig(δ,τ) \begin{cases} true &if δ=(δ∧τ)\cr false &otherwise \end{cases} $

2.

3.

4.

5. $Value← 0$

6. $p ← 31$

7. $m ← 109 + 9$

8. $pow ← 1$

9.

10. $Value ← Value +character × pow mod m$

11. $pow← pow × p$

12.

13.

14. $randomValue← PRNG(Value)$

15. $bit←randomValue$

16. $KeySig[bit] ← 1$

17. $i← i + 1$

18. $Sig← Sig | KeySig$

19.

$(S[0] + S[1]·p+S[2]·p^2+⋯+S[N-1]·p^{N-1})modm$

where $p$ and $m$ are positive primes. In our study, p is set to 31, which is known to have a low probability of collision of hash function values, and m is set to 10The keyword signature, $KeySig$, isset to 1 based on the generated random number (lines 14–18). Pseudo-random number generator (PRNG) represents an algorithm for generating random sequences using mathematical formulas. The PRNG generates a series of numbers similar to the properties of random numbers (line 15). The keyword signatures generated for each keyword are OR’ed together to generate the node signature, $Sig$ (line 19).

KM-DBSCAN operates in the same way as the general DBSCAN,butthe difference is that the signature-pruning process is added to the range query. Algorithm 5 showsthesearch process using the signature of the IR

2.

3. $H.add(t.root)$

4. $δ ← MakeSignature(Q)$

5.

6. $e←H.remove()$ // remove a top entry

7. $minDis$←minimum distance between $tp$ and $e$

8. $τ$← Siganture of $e$

9.

10.

11.

12. $minDis$←minimum distance between tp and $child$

13. $τ$← Siganture of $child$

14.

15. $H.add(child)$

16.

17.

18. minDis←minimum between tp and data

19. $τ$← Siganture of $data$

20.

21. $N.add(data)$

22.

An R-tree is a representative index used for storing spatial data. Since DBSCAN also processes spatial data, an R-tree is used to store data. A time complexity of an R-tree is $O(log_mn)$, $m$ is the maximum number of entries in one node and $n$ is the number of data. Therefore, the cost ofɛ-neighborhood query is also $O(log_mn)$. However, since an R-tree has no structure that can represent a set of keywords for an object, it must be searched up to leaf nodes when executing the ɛ-neighborhood query to access the keywords owned by the object. An IR

Experiments

In this section, we evaluate the performance of our proposed three DBSCAN algorithms (NK-DBSCAN, IN-DBSCAN, and KM-DBSCAN) by varying several parameters. We used real and synthetic datasets in the experiments. We use variouscardinalities $N$, various signature lengths $𝓁$, various numbers of query keywords $𝓀$,radiusofneighborhoodɛ, andminimumnumberofneighbortuplesminPtr. Table 3 lists the system parameters we used in our experiments,anddefaultvaluesareshowninboldface. In each experiment, we varied asingleparameterwhiletheother parameters were set to their default values.

**Table 3.** System parameters

Parameter | Value |
---|---|

Cardinality of tuples ($N$) | 100K, 200K, 400K, 600K, 1M |

Signature length ($𝓁$) | 64, 128, 192, 256, 320, 384, 448, 512, 640, 768 |

Number of query keywords ($𝓀$) | 1, 2, 3, 4, 5 |

Radius of a neighborhood ($ɛ$) | 10, 20, 30, 40, 50 |

Minimum number of neighbor tuples ($minPtr$) | 3, 5, 10, 15, 20 |

Distribution (for spatial data vectors) | Uniform, Gaussian |

All the proposed algorithms were implemented in Java, and the experiments were conducted on a Linux (Ubuntu 14.04) platform using an Intel Corei7 at 3.50 GHz with 32 GB memory. The page size of a node on the IR

We used real and synthetic datasets. For the real dataset, we collected sets of items from data.world (https://data.world/datasets/geospatial). Each tuple had longitude and latitude coordinates as spatial data. The data collected was slightly modified for our experimental purposes. Because the real dataset was highly skewed, we normalized spatial data into the space (0, 0) × (5K, 5K). In order to evaluate the effects of various parameters, we generated two types of synthetic datasets. One has spatial data with uniform distribution in the space, and the other has spatial data with Gaussian distribution. Our dataset has various cardinalities N in the range [100K,1M], signature lengths 𝓁 in the range[64,768]. The spatialdatavector space is (0, 0) × (5K, 5K). Each tuple is described by eightkeywords.

In all the experiments, we focus on CPU time. For all the algorithms, we measure their performance using elapsed time. Elapsed time is the total execution time of the algorithm. However,theonlyelapsedtime for IN-DBSCANincludes IR

Effect of Scalability

In this section, we evaluate scalability with respect to the cardinality N of the dataset. We conduct experiments with various cardinalities N[100K,1M], andset𝓁 = 768, and$𝓀$ = 2. Fig.6 shows the experiments for varying $N$. As shown in Fig. 6, KM-DBSCAN generally outperforms NK-DBSCAN and IN-DBSCAN with all the datasets. The performance difference increases as the cardinality increases. This shows that the KM DBSCAN algorithm performs effectively for a large dataset. Specifically, the difference between KM-DBSCAN and the others in the case of the real dataset is greater than the difference with the synthetic datasets. This is because KM-DBSCAN reduces the number of accessing nodes by the signature of the IREffect of the Number of Query Keywords

In order to study the effect of the number of query keywords, we vary the number of query keywords 𝓀 within the range [1,5] when N is 1M and 𝓁 = 768. In this experiment, KM-DBSCN generally outperforms NK-DBSCAN and IN-DBSCAN on both synthetic and real datasets. Fig. 7 shows that the difference between KM-DBSCAN and the others decreases when increasing the number of query keywords, especially with the synthetic dataset. This is because we found tuples with all the keywords given as query keywords in the experiment, and with more of the given query keywords, more entries that matched the query signature were not pruned early.Effect of Signature Length

In this experiment, we study the effect of the signature length in the IRIn order to find the effect of the length of the signature, we vary the length of signature 𝓀 to within the range [64,768] when N is 1M and 𝓀 = 2. Fig. 8(a) shows that the CPU time in KM-DBSCAN, IN-DBSCAN, and NK-DBSCAN decreases with increasing lengths of the signature. In the real data, the experiment when 512 ≤ 𝓁shows that there is little difference in the degree of decrease in CPU time.Figs. 8(b) shows that the number of nodes increases as the signature length increases. Uniform and Gaussian distributions have the same number of nodes regardless of their distribution.

Effect of minPts

In this section, we evaluate scalability with respect to the number of minPts. We conduct experiments with various minPts[2,4,8,16,32], andset𝓁 = 768,and𝓀 = 2. Fig. 9 shows the experiments for varying size of minPts. As shown in Fig. 9, KM-DBSCAN generally outperforms NK-DBSCAN and IN-DBSCAN with all the datasets. NK-DBSCAN is not sensitive to the number of minPtsthan other algorithms. For NK-DBSCAN, the number of minPts does not significantly affect performance. This is because increasing a size of minPts does not reduce the operation to find ɛ-neighborhood of a tuple in NK-DBSCAN. The performance of KM-DBSCAN and IN-DBSCAN improves as a size of minPts increases. This is because the number of cells with the core tuple decreases as a size of minPts increases, thus reducing the operation of the merging step.Effect of ɛ

To confirm the effects of the size of ɛ, we vary the number of cells from 10 to 50. Other parameters are given their baseline values (Table 3). Fig. 10 shows the effect of the size of ɛ. For each algorithm, the runtime is approximately proportional to the size of ɛ. KM-DBSCAN outperforms IN-DBSCAN and NK-DBSCAN. Even if the size of ɛ increases, the difference between the execution time of KM-DBSCAN and IN-DBSCAN is kept almost constant. The execution time of NK-DBSCAN increases significantly as the size of ɛ increases. This is because the number of MBRs included in ɛ increases as the size of ɛ increases. When searching many MBRs in the R-tree, the execution speed is greatly increased. Conclusion

This paper defines the keyword-matched DBSCAN problem and the proposed three algorithms. To solve the keyword-matched DBSCAN problem, we first proposed NK-DBSCAN using simple pruning on an R-tree. Then, we described IN-DBSCAN on inverted indexing-only keyword data. Finally, we presented KM-DBSCAN on an IR^{2}-tree that maintained additional keywords information of tuples as signatures. KM-DBSCAN retrieves keyword-matched tuples by using a nearest-neighbor search technique on an IR^{2}-tree. While traversing the IR^{2}-tree, KM-DBSCAN effectively prunes nodes that do not contain query keywords by means of signatures stored in IR^{2}-tree nodes. In order to evaluate the performance of our algorithms, we conducted some experiments in various settings. The experimental results showed that KM-DBSCAN is effective in improving computational performance. In future work, we plan to extend our proposal to deal with not only uniform and Gaussian distribution but also skewed data distribution and to consider an optimized keyword-matched DBSCAN algorithm in a parallel processing environment.

Acknowledgements

Not applicable.

Author’s Contributions

Conceptualization, BK, JJ. Funding acquisition, BK. Investigation and methodology, BK. Project administration, BK. Resources, JJ. Supervision, JJ. Writing of the original draft, BK. Writing of the review and editing, BK, JJ. Software, JJ. Validation, BK. Data Curation, BK. Visualization, BK, JJ.

Funding

This work was supported by the Dongguk University Research Fund of 2019 and by the Research Grant of Jeonju University in 2021. This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Korea government (MSIT) (No. 2020R1F1A1077369) and by the Korea government (MSIT) (No. 2021R1F1A1049387).

Competing Interests

The authors declare that they have no competing interests.

Author Biography

**Name : Hong-Jun Jang**

**Affiliation :** Jeonju University

**Biography :** He received the B.Sc. and Ph.D. degrees in computer science education from Korea University, Seoul, South Korea. His research interests include data mining, machine learning, and data base.

**Name : Byoungwook Kim**

**Affiliation :** Dongguk University Gyeongju Campus

**Biography :** He received the B.Sc. and Ph.D. degrees in computer science education from Korea University, Seoul, South Korea. Since 2018, he has been with the Department of Computer Engineering, Dongguk University, Gyeongju, South Korea. He had been with the Creative Informatics and Computing Institute, Korea University, Seoul. He is currently a Research Fellow with the New Media Education and Development Center, Korea National Open University. His research interests include data mining, machine learning, intelligent tutoring systems, and e-learning systems.

References

[1] Q. Liu, M. Deng, Y. Shi, and J. Wang, “A density-based spatial clustering algorithm considering both spatial proximity and attribute similarity,” Computers & Geosciences, vol. 46, pp. 296-309, 2012.

[2] J. Sander, M. Ester, H. P. Kriegel, and X. Xu, “Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 169-194, 1998.

[3] V. Estivill-Castro and I. Lee, “Multi-level clustering and its visualization for exploratory spatial analysis,” GeoInformatica, vol. 6, no. 2, pp. 123-152, 2002.

[4] Z. Li, Algorithmic Foundation of Multi-Scale Spatial Representation. Boca Raton, FL: CRC Press, 2006.

[5] H. J. Miller and J. Han, Geographic Data Mining and Knowledge Discovery. Boca Raton, FL: CRC Press, 2009.

[6] M. Ester, H. P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD), Portland, OR, 1996, pp. 226-231.

[7] H. J. Jang, K. S. Hyun, J. Chung, and S. Y. Jung, “Nearest base-neighbor search on spatial datasets,” Knowledge and Information Systems, vol. 62, no. 3, pp. 867-897, 2020.

[8] D. W. Choi and C. W. Chung, “Nearest neighborhood search in spatial databases,” in Proceedings of 2015 IEEE 31st International Conference on Data Engineering, Seoul, Korea, 2015, pp. 699-710.

[9] D. Zhang, C. Y. Chan, and K. L. Tan, “Nearest group queries,” in Proceedings of the 25th International Conference on Scientific and Statistical Database Management, Baltimore, MD, 2013, pp. 1-12.

[10] H. J. Jang, B. Kim, J. Kim, and S. Y. Jung, “An efficient grid-based k-prototypes algorithm for sustainable decision-making on spatial objects,” Sustainability, vol. 10, no. 8, article no. 2614, 2018. https://doi.org/10.3390/su10082614

[11] H. P. Kriegel, P. Kroger, J. Sander, and A. Zimek, “Density‐based clustering,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1, no. 3, pp. 231-240, 2011.

[12] C. Dharni and M. Bnasal, “An improvement of DBSCAN Algorithm to analyze cluster for large datasets,” in Proceedings of 2013 IEEE international conference in MOOC, innovation and technology in education (MITE), Jaipur, India, 2013, pp. 42-46.

[13] D. Wu, J. Shi, and N. Mamoulis, “Density-based place clustering using geo-social network data,” IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 5, pp. 838-851, 2017.

[14] T. O'Reilly, “What is Web 2.0: design patterns and business models for the next generation of software,” 2007 [Online]. Available: https://mpra.ub.uni-muenchen.de/4578/1/MPRA_paper_4578.pdf.

[15] J. Zobel, A. Moffat, and K. Ramamohanarao, “Inverted files versus signature files for text indexing,” ACM Transactions on Database Systems (TODS), vol. 23, no. 4, pp. 453-490, 1998.

[16] I. De Felipe, V. Hristidis, and N. Rishe, “Keyword search on spatial databases,” in Proceedings of 2008 IEEE 24th International Conference on Data Engineering, Cancun, Mexico, 2008, pp. 656-665.

[17] N. Roussopoulos, S. Kelley, and F. Vincent, “Nearest neighbor queries,” in Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, CA, 1995, pp. 71-79.

[18] G. R. Hjaltason and H. Samet, “Distance browsing in spatial databases,” ACM Transactions on Database Systems (TODS), vol. 24, no. 2, pp. 265-318, 1999.

[19] J. Shi, N. Mamoulis, D. Wu, and D. W. Cheung, “Density-based place clustering in geo-social networks,” in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, UT, 2014, pp. 99-110.

[20] J. Chen, H. He, J. Chen, S. Yu, and Z. Shi, “Fast density clustering algorithm for numerical data and categorical data,” Mathematical Problems in Engineering, vol. 2017, article no. 6393652, 2017. https://doi.org/10.1155/2017/6393652

[21] H. Song and J. G. Lee, “RP-DBSCAN: a superfast parallel DBSCAN algorithm based on random partitioning,” in Proceedings of the 2018 International Conference on Management of Data, Houston, TX, 2018, pp. 1173-1187.

[22] S. Hayat, A. Rextin, A. Idris, and M. Nasim, “Text and phone calls: user behaviour and dual-channel communication prediction,” Human-centric Computing and Information Sciences, vol. 10, article no. 11, 2020. https://doi.org/10.1186/s13673-020-00217-x

[23] M. Zhang, R. Fu, Y. Guo, L. Wang, P. Wang, and H. Deng, “Cyclist detection and tracking based on multi-layer laser scanner,” Human-centric Computing and Information Sciences, vol. 10, article no. 20, 2020. https://doi.org/10.1186/s13673-020-00225-x

[24] Y. Liao, “Hot spot analysis of tourist attractions based on stay point spatial clustering,” Journal of Information Processing Systems, vol. 16, no. 4, pp. 750-759, 2020.

[25] G. Cong, C. S. Jensen, and D. Wu, “Efficient retrieval of the top-k most relevant spatial web objects,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 337-348, 2009.

[26] B. Yao, F. Li, M. Hadjieleftheriou, and K. Hou, “Approximate string search in spatial databases,” in Proceedings of 2010 IEEE 26th International Conference on Data Engineering (ICDE), Long Beach, CA, 2010, pp. 545-556.

[27] J. B. Rocha-Junior, O. Gkorgkas, S. Jonassen, and K. Norvag, “Efficient processing of top-k spatial keyword queries,” in Advances in Spatial and Temporal Databases. Heidelberg, Germany: Springer, 2011, pp. 205-222.

[28] H. Choi, H. Jung, K. Y. Lee, and Y. D. Chung, “Skyline queries on keyword-matched data,” Information Sciences, vol. 232, pp. 449-463, 2013.

[29] H. H. Bu, N. C. Kim, B. J. Yun, and S. H. Kim, “Content-based image retrieval using multi-resolution multi-direction filtering-based CLBP texture features and color autocorrelogram features,” Journal of Information Processing Systems, vol. 16, no. 4, pp. 991-1000, 2020.

[30] W. B. Frankes and R. Baeza-Yates, Information Retrieval: Data Structures & Algorithms. Upper Saddle River, NJ: Prentice Hall, 1992.

About this article

Cite this article

Hong-Jun Jang^{1} and Byoungwook Kim^{2,*}, KM-DBSCAN: Density-Based Clustering of Massive Spatial Data with Keywords, Article number: 11:43 (2021) Cite this article 1 Accesses

**Recived**8 June 2021**Accepted**4 September 2021**Published**30 November 2021

Share this article

Anyone you share the following link with will be able to read this content:

Get shareable link

http://hcisj.com/articles/?HCIS202111043

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords