ArticlesAll Issue
ArticlesMSSRM: A Multi-Embedding Based Self-Attention Spatio-temporal Recurrent Model for Human Mobility Prediction
• Shunjie Wen1,2, Xu Zhang1,2,*, Ruixu Cao1, Boming Li1, and Yan Li3

Human-centric Computing and Information Sciences volume 11, Article number: 37 (2021)
https://doi.org/10.22967/HCIS.2021.11.037

Abstract

Human mobility affects many aspects of anurban area, including spatial structure, temporal connectivity, even response to epidemics. Prediction of human mobility is of great significance for a wide spectrum of location-based applications. To enhance the spatio-temporal contexts between check-ins, we encode check-in locations as a graph and propose a multi-embedding based self-attention spatio-temporal recurrent model (MSSRM) for human mobility prediction. In this paper, we first obtain elaboratespatial and temporal embeddings from the directed weighted graph of spatio-temporal points and the frequency distribution of users’ visits. Subsequently, we adopt a long short-term memory layer to capture the long-term and short-term spatio-temporal dependencies and introduce a self-attention mechanism to distinguish each location in different contexts. Finally, we use a fully connected layer and incorporate user information to yield prediction results. Extensive experiment results based on two real-world datasets demonstrate that our model outperforms the state-of-the-art models.

Keywords

Human Mobility, Network Representation Learning, LSTM, Attention

Introduction

Human mobility impacts many aspects of urban areas, from spatial structure to response to epidemics [1]. As the number of urban sensors increases, researches on human mobility in urban computing become increasingly attractive. Phone positioning data [2, 3], subway network data [4], social network data [5], and wearable sensors data [6, 7] provide unprecedented continuous location data, which indicates human mobility patterns. Recently, the discovery of human mobility patterns has been a major challenge and a popular research topic in urban computing, which can be used for mobility behaviors mining [810], location-based recommendations [5, 1113], traffic flow or crowd prediction [1416], and abnormal activities recognition [17]. Most location-based service providers can accurately recommend a point-of-interest (POI) and relevant advertisements based on a user’s current and historical geographic location and personal information. For example, Didi and Uber can provide high-quality ride-sharing services based on discovery and understanding of human mobility, e.g., prediction of service requests, estimated time of arrival. A spatial trajectory is a sequence of GPS points (locations) arranged in chronological order, which contain some rich spatio-temporal contextual information. Spatial and temporal contextual information plays a key role in understanding user movement behaviors and helps predict where he or she will head next. Models of human mobility can be aimed at reproducing individual mobility patterns or general population flows [18]. In this paper, we focus on the next location prediction, which indicates individual mobility patterns.
Several traditional methods [19, 20] proposed to predict human mobility with the Markov chain, which is a set of discrete random variables with Markov properties. These methods captured the regularity of human movement and each state in the Markov chain transfers through a certain conditional probability. However, the process of transfer is memoryless which limits the result of the current state to be affected only by the previous state. Recently, extensive works based on deep learning have been carried out on mobility prediction. Since trajectory data containing spatio-temporal information is essentially a series of sequence data, the recurrent neural network (RNN) model [21], an excellent sequence modeling method, has been introduced into this field.Various studies have attempted to discover human mobility with RNN and its variants, like long short-term memory (LSTM) network [22] and gated recurrent unit (GRU). However, the perception ability of LSTM reduced sharply, and some long-term dependency was lost when the volume of trajectory data increased. Therefore, historical attention has been introduced to capture the long-term dependence in a longer sequence, which can adaptively calculate the weights of important spatio-temporal points in the user’s historical trajectory [23].Moreover, trajectory sparsenessis often considered in the next location prediction, because location-based data with rich semantics are usually sparse due to device and sampling frequency limitations.To mitigate the effects of sparsity, the extraction of spatio-temporal context is particularly important. Some studies [12, 24] focus on the temporal intervals and spatial distances to enrich the spatio-temporal contexts to better distinguish each location. However, these methods mainly consider the local information between spatio-temporal points in the trajectory sequence, and cannot grasp more general and comprehensive spatio-temporal information. In addition, to capture periodicity, some works [5, 23] divide each timestamp into 48-hour intervals and encode them directly as temporal feature vectors, but ignore the influence of real-world timestamps.
To alleviate the problems mentioned above, we propose a multi-embedding based self-attention spatio-temporal recurrent model (MSSRM) for human mobility prediction from lengthy and sparse trajectories. InMSSRM, we design a multi-embedding encoding module to capture the transition regularities of human movements. Specifically, we encode the users’ trajectories into a directed weighted graph, which contains more general and comprehensive contextual information about the order and frequency of visits. Next, we acquire the temporal frequency distribution of visits from real-world timestamps, which reflect the periodicity and trend of location visits.This module not only alleviates the curse of dimensionality associated with large-scale movement, but also contains rich spatio-temporal contextual information. The MSSRM is capable of capturing long-term dependency, and it applies LSTM to well-designedspatial and temporal embeddings. Another key component in MSSRM is a self-attention mechanism, which distinguishes each location in different contextual environments to better understand the semantic motivations of users in different contexts.
In general, our contributions are summarized as below:

We propose a novel human mobility prediction framework MSSRM, which is able to capture the sequential spatial and temporal features and encode the trajectory into a graph structure.

We propose to learn a vector representation of time in the mobility prediction task to obtain the temporal embeddings, which can extract the temporal information in trajectories and incorporate it into the location embeddings.

We propose to classify locations via a self-attention mechanism with different contextual representations and incorporate more collective and comprehensive information with Node2Vec.

Extensive experiments on two real-life trajectory datasets exhibit that the MSSRM model outperforms several state-of-the-art methods with accuracy.

The rest of this paper is organized as follows. We first review related works in Section 2. Then, we formulate the problem and briefly introduce the concepts of our work in Section 3. Next, in Section 4, we describe details of the architecture of the MSSRM model. After presenting experimental evaluations and an extensive analysis of the performance in Section 5, we finally conclude our work in Section 6.

Related Work

Non-neural Network Methods
Most of the non-neural network methods are proposed based on the Markov chain, which means that the movement of the individual follows the state transition matrix. The NLPMM [25] considers both individual and collective movement patterns on sparse trajectories and suits them to different periods. Some works are based on hidden Markov model (HMM), and the hidden unknown parameters are determined by the observable ones. In addition, a density-based trajectory division algorithm is introduced, which helps to improve prediction efficiency. Liu et al. [26] extract the dwelling point based on the mined user trajectory and use HMM to predict the next location and propose suggestions.
The next location prediction indicates individual mobility patterns, which are well developed in POI recommendation systems. The basic idea of the recommendation system is to discover the internal association pattern. Matrix factorization and collaborative filtering are common methods for modeling users’ static preferences only. The PRME [27] models personalized check-in sequences, adopting multi-embeddings that integrate sequence information, personal preferences, and geographic influence. The Rank-GeoFM [28] adopts a ranking-based geographic decomposition method, which characterizes the check-in frequency as users’ visiting preference and learn the factorization by ranking the POIs correctly. Unfortunately, these non-neural network methods rarely consider rich contextual information and fail to handle long-term dependence.

Neural Network Methods
Recently, RNN is widely used due to its excellent performance in processing sequential data. The ST-RNN [21] uses a time-specific transition matrix and a distance-specific transition matrix based on the RNN unit to enhance the ability of the model to receive spatio-temporal contextual information. Regrettably, RNN appears to be ineffective when dealing with long-term dependence, so methods based on variants of RNN (LSTM and GRU) are constantly being proposed, mainly used to prevent gradient disappearance. Kong and Wu [8] use spatial and temporal embeddings to feed LSTM to predict the user’s next location. However, the contextual information of time and space is not fully utilized. The Time-LSTM [29] designs time gates to model time intervals. Other methods [12, 24] introduce time intervals and distance intervals to enhance the ability of context-awareness. But people gradually realized that whether it is a convolutional neural network or a recurrent neural network, it is essentially a local encoding of variable length sequences.With the development of deep learning, the attention mechanism with dynamically generated weights has been proposed to deal with the long-term dependence of longer sequences.DeepMove [23] adopts an attention mechanism to capture the user’s mobility preferences in the historical and current trajectories. The CARA [30] proposes a contextual attention gate that controls the influence of the ordinary context on the users' contextual preferences and leverages both sequences of feedback and contextual information associated with the sequences to capture the users’ dynamic preferences. The ATST-LSTM [31] uses an attention mechanism that can selectively focus on relevant historical check-in records in a check-in sequence based on spatio-temporal contextual information. Sun et al. [5] consider users’ long-term preferences and the geographical relations among POIs when modeling users’ short-term preferences.
Overall, most existing methods, involving fully-connected spatial and temporal embeddings, fail to consider the more general and comprehensive collective information of trajectories as a whole. Moreover, in terms of temporal modeling, those methods take the periodicity of user visits into account but usually omit temporal contexts such as the visit frequency.

Preliminaries

In this section, we first formally give problem formulations and term definitions, and then briefly introduce the principles of the recurrent neural network involved in the following.

Problem Formulation
Let $𝕌=u_1,u_2,⋯,u_U and 𝕃=l_1,l_2,⋯,l_L$ denote the set of Uusers and L locations, respectively. For each location, it is geocoded by a (longitude, latitude) tuple, i.e., ($lng_l,lat_l$). As for candidate locations, they can be POI in reality or equally-sized grids. In our work, we use grids to represent the locations.
Definition 1 (Trajectory Sequence). A raw check-in record can be represented as a tuple $c_k^{u_i}=(u_i,l_k,t_k )$, where $u_i$ denotes the user ID, and $l_k$ indicates the place that the user visited at the timestamp $t_k$. A trajectory sequence $S^u$ is an ordered sequence of spatio-temporal points by timestamp for user u, which consists of a series of check-in records, $i.e., S^u=c_1^u c_2^u⋯c_n^u$.
Definition 2 (Trajectory). According to the fixed time window $t_w$, a user's trajectory sequence $S^u$ is divided into several subsequences which are defined as trajectories, $i.e., S^u={S_{t_{w_1 }}^u,S_{t_{w_2}}^u,⋯,S_{t_{w_m}}^u \}$, where m is the index of the current trajectory. Each trajectory $S_{t_w}^u$ contains several check-in records of S^u in the time window $t_w, i.e., S_{t_w}^u=c_i^u c_{i+1}^u⋯ c_{i+k}^u and if ∀1<j ≤i+k,t_j$, belongs to $t_w$. The time window $t_w$ is the time interval between two trajectories and it can be adjusted to 1 hour, 1 day, or other thresholds according to specific demands.
Problem 1 (Mobility Prediction). For a certain user $u∈U$, given the current trajectory $S_m^u∈S^u$ and the location candidates $𝕃={l_1,l_2,⋯,l_L}$, along with the user’s historical trajectory ${S_1^u,S_2^u,⋯,S_{m-1}^u \}$, our goal is to find the top-K preferable output $l∈𝕃$ at the next timestamp.

LSTM
LSTM is a classical variant of RNN that was first introduced in 1997 by Hochreiterand Schmidhuber[22]. It is mainly to overcome the gradient dispersion and gradient explosion during long sequential data training. In RNN, the information of the hidden layer at this moment solely comes from the current input and the information of the previous moment, which does not reflect the memory. Compared to RNN, LSTM can capture both long-term and short-term dependence in sequential data with multiple gates and sigmoid function. Similar to learning the relationship between each word in a sentence in machine translation, LSTM is also widely be used to extract the relationship among locations visited at different time intervals in sparse trajectories.
An LSTM cell at timestep n is formulated as following:

$f_n=σ(W_f⋅[h_(n-1),x_n ]+b_f )$(1)

$i_n=σ(W_i⋅[h_(n-1),x_n ]+b_i )$(2)

$\tilde c_n=tanh⁡(W_c⋅[h_{n-1},x_n ]+b_c )$(3)

$\tilde c_n=f_n⊙c_{n-1}+i_n⊙\tilde c_n ̃$(4)

$o_n=σ(W_o⋅[h_{n-1},x_n ]+b_o )$(5)

$h_n=o_n⊙tanh⁡(c_n )$(6)

where $x_n$ is the input at timestep $n, h_{n-1}$ is the state of the hidden layer at last timestep $n-1, W$ represents a learnable weight matrix;b is the bias matrix of each gate. $f_n, i_n, o_n$ are a forget gate, an input gate, and an output gate, respectively. σ indicates the activation function sigmoid, which maps the value between 0 and 1, controlling the discarding and retaining. tanhis a hyperbolic tangent function that maps values between -1 and 1. ⊙represents the dot product (Hadamard product) of two matrices. Here, $c_n$ is the state of the cell, which is updated by two parts, one is the dot product of $f_n$ and $c_{n—1}$, the other is the dot product of $i_n$ and $\tilde c_n$, where $\tilde c_n$ is the candidate state of the cell.

Methodology

In this section, the proposed approach is presented in detail. Fig. 1 depicts the overall architecture of MSSRM which mainly consists of a multi-embedding encode module, a recurrent module, an attention module, and a prediction layer.
In the first phase, at the collective level, we extract spatial and temporal features with rich context from all users’ trajectories. The second part demonstrates the recurrent module we use. Then the third part introduces the attention layer that enhances contextual features. Finally, in the prediction layer, we extract users’ preferences features at the individual level and obtain the prediction results.

Fig. 1. The overall framework of the MSSRM.

Multi-Embedding Encoding Module
Generally, it is crucial to understand not only how many individuals move from place to place but also how often they do so [1]. Indeed, as places attract individuals for reasons as diverse as work, shopping, or recreation, it is of great significance to extract temporal and spatial features from trajectories, which indicate rich contextual information. Usually, in the human mobility prediction tasks, traditional representations such as one-hot encoding are utilized to distinguish each candidate location. However, due to the curse of dimensionality, trajectory sparseness, and computational inefficiency, traditional representations tend to be gradually replaced by the embedding method. The embedding method associates each discrete candidate location with a low-dimensional dense vector and converts the identifier of each location into a latent space in a predefined size before capturing the long-term dependence. And the values of the dense vector can be updated during the training process. In particular, locations that often appear at the same time in the same trajectories share similar representations in latent space. Similarly, in these approaches, time is directly converted into a dense vector through the fully-connected embedding layer.

Spatial features extractor
Most of the existing next location prediction algorithms are based on distributed word representations. The features of each location are encoded into a latent representation vector, which omits a lot of effective contexts, and it is difficult to distinguish each location from each other. A high-quality location embedding contains the potential interaction relationship of each location and more comprehensivelocation information, which can be more discriminative and can also be adjusted through backpropagation during training.
Inspired by Grover and Leskovec[32], we adopt a graph embedding method called Node2Vec to encode each candidate location in all trajectories to a high-quality embedding vector. For the trajectories of all users, we first construct a directed weighted graph G = (V,E). The directed weighted graphs carry richer information, which reflects collective visit order preference and frequency information. Vis the set of vertices of G, which indicates the set of candidate locationL in this paper, i.e., $|V| = N$. Eis the set of edges of G, and we define the edge $e_ij∈E$ according to (7):

$e_{ij}= \cases{ 1,from l_i to l_j \cr 0,unachievable)' }$ (7)

According to the transition of spatio-temporal points of all users, we get the edge set E, and w_ij denotes the weight of the edge e_ij, which is calculated from the total number of visits. Then we apply Node2Vec to the constructed trajectory graph. The loss function of spatial feature extractor is expressed as (8):

$L_{space}=\underset f {max} \displaystyle\sum_{u∈V} log Pr(N_s (u)∣f(u) )$(8)

where f(u)is the current node, N_s (u) denotes the adjacent node that sampled in strategy s. For the current node, the probability of its adjacent node appearing is described as (9):

$Pr(N_s (u)∣f(u))= \displaystyle\prod_{n_i∈N_s (u)} Pr⁡(n_i∣f(u) )$(9)

where  Pr(n_i∣f(u) ) can be calculated by (10):

$Pr(n_i∣f(u))=\frac{exp⁡(f(n_i)⋅f(u))} {∑_{v∈V} exp⁡(f(v)⋅f(u))}$(10)

From (9) and (10), the loss function of the spatial feature extractor is shown as (11):

$L_space=\underset f {max} \displaystyle\sum_{u∈V} \Bigg[-log⁡ Z_u + \displaystyle\sum_{n_i∈N_f (u)} f(n_i ) ⋅f(u)\Bigg]$(11)

where $Z_u=∑_{v∈V} exp⁡(f(u)⋅f(v))$. Node2Vec combines breadth-first sampling (BFS) and depth-first sampling (DFS) methods to sample nodes in the graph G. The local microscopic view is extracted by BFS, and the global macroscopic view is obtained by DFS. A probability distribution is defined as (12), which denotes that the transition probability of a node to its different adjacent node:

$P(c_i=x∣c_{i-1}=v)= \cases{ \frac{π_vx}{Z}, &if(v,x)∈E\cr 0, &otherwise }$ (12)

where $π_{vx}$ is the unnormalized transition probability between vertex v an x, and Z is the normalization constant. For our weighted graph G, the transition probability is multiplied by the weight $w_{ij}$ between two nodes in (13). Node2Vec introduces two hyperparameters p and q to control the 2nd-order random walks in (14):

$π_vx=α_pq (t,x)⋅w_vx$(13)

$α_{pq}(t,x) = \cases{ \frac{1}{p}, &if d_tx=0\cr 1, &if d_tx=1\cr \frac{1}{q}, &if d_tx=2) }$ (14)

where return parameter p controls the probability of repeated visits to the vertex just visited, and in-out parameter qcontrols whether the wandering is outward or inward. In this paper, we set p=q=0.25 and the walking length to 80. After sampling the vertices by Alias algorithm, we obtain the embedding vector of each vertex in graph Gusing Word2Vec, and the set of the embedding vectors of the candidate locations is shown in (15):

$B_l="Node2Vec" (V,E)=[b_{l_1}, b_{l_2}, ⋯,b_{l_L}],b_{l_i}∈R^{d_L}$(15)

where $b_{l_i}$ denotes the $i^{th}$ $d_L$-dimensional spatial features embedding vector. The location embedding vectors are obtained by constructing a directed weighted graph of the trajectories, and we get more expressive representations that duly consider the interaction between each candidate location and its neighbors, and the richer context spatial information of them can be captured.

Temporal features extractor
The mobility prediction task utilizes the current trajectory and historical records to predict the location at the next timestamp. Existing research usually performs a series of pre-processing at time level to extract temporal features for time series prediction or next location prediction problems, such as encoding the temporal category features (days, weeks, months, and seasons), or manual feature engineering (Fourier transform), etc. In the trajectory data, time is closely associated with a location closely when a visit occurs. The temporal pattern carries the user's visit preference at the time level. Human mobility patterns usually contain temporal regularity, in hence, the richer semantic information can be captured from the absolute visit time of each spatio-temporal point. In DeepMove, the time of the current trajectory and the historical trajectory is divided into 48-time slots. Although it can reflect a certain periodicity, it also omits a part of the effective temporal contexts.
Inspired by Kazemi etal. [33], we obtain temporal embedding vectors using Time2Vec instead of fully-connected layer to encode time. Firstly, we obtain the time distribution of location visits. Then we adapt Time2Vec to model time and obtain the representation of each absolute visit timestamp. For a given scalar notion of time τ, Time2Vec of τ is a vector of size k+1 defined as follows:

$b_{t_i}=Time2Vec(τ)[i]=\cases{ (ω_i τ+φ_i, &if i=0\cr F(ω_i τ-φ_i), &if 1≤i≤k. }$ (16)

$B_t=[b_{t_1}, b_{t_2},⋯,b_{t_T}],b_{t_i}∈R^{d_T}$(17)

where Time2Vec(τ)[i] is the $i^{th}$ $d_L$-dimensional element of Time2Vec(τ) , F is a periodic activation function, and ω_is and φ_is are learnable parameters. In the original transformer proposed by Vaswani et al. [34], the model introduced position encoding to distinguish the order of the input sequence, which is an important feature that cannot be ignored for sequence prediction. In the position encoding module, they added sine and cosine functions to those vector representations as shown in (18), so that the resulting vector information about the position of the item in the sequence. These sine and cosine functions are called positional encoding. Here, let F(⋅) be sin⁡(⋅), so that Time2Vec can be considered as representing continuous-time, instead of discrete positions, which can capture periodic behaviors.

$Φ(o)=[cos(ω_1 o),sin(ω_1 o),⋯,cos(ω_d o),sin(ω_d o) ],for ω_k=1/10000^{2k/d}.$(18)

In this paper, we set the time interval to be 1 hour and obtain a series of time slots. Then, we let kequal to 9 so we get each temporal embedding vector with 10 dimensions. Each absolute timestamp is mapped into those time slots and the high-quality temporal representations not only reflect the users’ order preference of visits but also contain the users’ periodic habits. Finally, the visit record of user u including time and location is denoted as $X^u∈R^{d×(d_L+d_T)}$ presented as (19):

$X^u=Concat(B_l^u ,B_t^u )$(19)

where spatial feature $B_l^u ∈R^{n×d_L}$ and temporal feature $B_t^u∈R^{n×d_T}$ concatenate as the input of the recurrent module; n denotes the length of the trajectory.

Recurrent Module and Context-Aware Module

LSTM
The next module consists of an LSTM layer, which is a more complex RNN type, of which the repeating unit includes four gates with different functions. At every step, each unit receives data from two sources: the current vector of the data sequence concatenated with the output vector of the unit at the previous step. The features flow through the LSTM units, encoded in the cell state, and are updated by four gates until the sequence is reached. To address the long-term dependence, we apply LSTM on spatio-temporal embeddings as below:

$H^u=[h_1,h_2,⋯,h_n ]=LSTM([x_1,x_2,⋯,x_n]) =LSTM(X^u)$ (20)

where $H^u∈R^{n×d}$represents the output state of the LSTM module for user u; d denotes the hidden size.

When people perceive an item, they will first pay attention to a specific part based on their needs, instead of processing the whole picture at once. In general, mobility prediction is essentially a chronological sequence task. Inspired by Vaswani et al. [34], we introduce a multi-head self-attention mechanism, which was initially adopted in machine translation tasks, to process longer spatio-temporal point sequences containing historical data and to understand the internal interaction among original states from the LSTM output. Compared to the basic attention mechanism, the query in the self-attention mechanism is input data itself. And multi-head self-attention mechanism can generate multiple different weight matrices and discriminate locations in different contextual environments.
To get the query, key and value for user u, the output of the LSTM module is projected to three parts as shown in (21)–(23):

$Q_i=H^u W_i^q$(21)

$K_i=H^u W_i^k$(22)

$V_i=H^u W_i^v$(23)

where $W_i^q∈R^{d×d_q}, W_i^k∈R^{d×d_k}, W_i^v∈R^{d×d_v}$ are query, key, value weight matrices for i^th head, respectively. Among them, $d_q=d_k=d_v$ and $d=d_H$, which is the dimension of hidden units of the recurrent module. Then, we apply scaled dot-product attention to obtaining the “intimacy” of each location with the others as shown in the following:

$Att(Q,K,V) =soft max⁡(\frac{QK^T}{\sqrt{\strut\rm d_k}})V$(24)

$head_i = Att(Q_i,K_i,V_i)$(25)

$A^u=MultiHead(Q,K,V)=Concat(head_1,head_2,⋯,head_h ) W^o$(26)

where $W^o∈R^{d×d}$ and multi-head are concatenated before projection. We set the number of headhto be 8 in this paper. Then we obtain the results of multi-head self-attention mechanism processing, which capture collective mobility pattern among users and can distinguish each location under different contextual information.

Prediction
After measuring the importance of different locations through the multi-head self-attention mechanism, the final location preference representation A^u is obtained. Then the probability distribution $y_k$ over $L$ is computed as followings:

$O^u = A^u W^o+b_o$(27)

$y^u=soft max(O^u+uW^u)$(28)

where $W^o∈R^(d×L),W^u∈R^(U×L)$ are trainable weight matrices ($U$ users and $L$ locations), b_ois the bias parameter of the fully-connected layer. Then, we apply cross-entropy loss on $y^u$ and the objective function $J$ can be formulated via (29):

$J=- \displaystyle\sum_{u∈U} \displaystyle\sum_{i=1}^L l_i^u log(y_i^u) + λ ‖Θ‖_2$(29)

where $l_i^u$ is the ground truth of each location for user u; 𝜆is the parameter of $L2$ regularization. and Θ denotes the parameters to be regularized in the model.

Experiments

In this section, we conduct experimental evaluations on two real-world datasets and compare the proposed MSSRM model with state-of-the-art methods.

Datasets
We conduct experiments on two public real-life location-based social network (LSBN) check-in datasets called Foursquare [35]. The Foursquare check-in datasets contain check-ins in New York (NYC) and Tokyo (TKY) collected for about 10 months (from April 12, 2012 to February 16, 2013). Among them, the dataset NYC contains 227,428 check-in records, while the dataset TKY contains 573,703 check-ins. The check-in record in these two datasets is associated with its timestamp, GPS coordinates, and venue categories. Following [25], we construct virtual grids for both two cities, respectively, and map the coordinates of each check-in record to the corresponding grid, such as mapping the dataset NYC into 100 × 100 grids. Similarly, we remove unpopular locations with less than 10 user visits and eliminate users with less than 50 check-in records. We truncate trajectories with a limit of 10 hours and inactivate users with less than 3 complete trajectories are also filtered out. More details of these two datasets after preprocessing are shown in Table 1.

Table 1. Statistics of the evaluation datasets
Datasets NYC TKY
City New York, United States Tokyo, Japan
Duration 12.04.2012–16.02.2013 12.04.2012–16.02.2013
Users (raw) 1,083 2,293
Records (raw) 227,428 573,703
Locations (raw) 38,333 61,858
Users (processed) 518 1,467
Locations (processed) 2,357 3,034
Trajectories (processed) 5,770 19,224
Longest trajectory 96 68
Trajectories/User 11.14 13.1

Baseline Methods
We compare our proposed MSSRM with six peer methods for human mobility prediction:

Markov [19]: This method is based on observations of the user's mobility behavior over time and the recent locations that one has visited, generating a transition matrix that is composed of the first-order transition probabilities.

LSTM [22]: This is a variant of the RNN model, which can effectively solve the problem of long dependence and is widely used to handle sequential data.

DeepMove [23]: This method learns the long-term regularity of human mobility by introducing the historical attention module and models the historical preference and complicated sequential information of the current trajectory using an RNN module.

STGN [12]: This method considers the spatio-temporal intervals between neighbor check-ins to model temporal and spatial contexts by modifying the LSTM cells with coupled gates.

SASRM* [24]: The SASRM without semantic information. This method designs a variant LSTM cell with only a time gate and a distance gate to capture the mail spatio-temporal dependencies and a flexible attention layer to capture mobility regularity from historical trajectories.

LSTPM [5]: This method considers users’ long-term preference and the geographical relations among recently visited locations, and it designs a nonlocal network for long-term preference modeling and a geo-dilated RNN for short-term preference learning.

Implementation Details and Evaluation Criteria
According to [24] we set the first 80% of each user’s trajectory as a training set and the rest is used for testing. We implement our model with PyTorch. All experiments are conducted on an NVIDIA 2080Ti GPU and Intel Xeon W-2133 CPU with 64G on the Ubuntu system. We choose Adam optimizer [36] and backpropagation through time (BPTT)to train it and the objective function is minimized. Both Dropout and ReduceLROnPlateaustrategies are adopted to prevent model overfitting. We clip the L2 norm of vector composed of several parameters of the gradient to alleviate the problem of gradient explosion. Dropout and weight decay are introduced to prevent overfitting. We set the L2 parameter λ=1e-6, dropout rate 0.5, the initial learning rate to 1e-3. The dimensions of location embedding and temporal embedding to 500 and 10, respectively. The hidden size d=512; the decay of the learning rate is 0.1, and the gradient clip is 5.0.
To evaluate the performance of each method for the next location prediction, we follow the previous works [22, 37], and adopt two evaluation metrics: Recall@K and Normalized Discounted Cumulative Gain (NDCG@K). Recall@Kmainly reflects the presence of the correct location among the top-K prediction results, while NDCG@Kmainly reflects the quality of the top-Kranking list. Inour work, we set the popular K={1,5,10,15}for evaluation and the greater accuracy is, the better the model performance.

Performance Evaluation
Table 2 shows the prediction performance of MSSRM and other baselines on the two datasets. In each column, the best result is highlighted in bold and the second-best underlined. We can see that our model outperforms all baselines with a 1.69%–5.76%improvement in Recall@K and NDCG@K. As the semantic information is not used in other baselines, we evaluate SASRM according to the ablation part of the model in its paper. From the statistics, we draw the following observations:
Among baseline models, all deep-learning-based models unequivocally have better performances over the model based on Markov chains. Although the Markov chains can generate a transition matrix according to the user’s visit sequence, it only depends on the location of the last visits and ignores the temporal patterns. Compared to the Markov model, the LSTM model which introduces temporal embedding can capture long-term dependencies through LSTM units.
However, the LSTM model, which has rarely considered historical information, tends to perform poorly for longer trajectories, and the long-term periodicity in the historical trajectory will be tossed. Capturing the user’s periodical pattern from historical trajectories, the DeepMoveand the SASRM model perform better than the LSTM model. Butthe historical attention used in SASRM focuses on short-term information in its paper (dr =3), so the performance looks not well enough. In particular, the improvement in NDCG scores can be concluded that DeepMove’s better prediction quality for multiple locations is mainly attributed to the modeling of users’ long-term and short-term preferences, respectively.
Similarly, LSTPM performs the best on two datasets because it also considers the long-term and short-term preferences. Besides, the LSTPM model introduces temporal and spatial contexts which is the key factor that LSTPM can outperform DeepMove. However, in spatial modeling, the methods above are all carried out based on sequence, and the structure of the entire trajectory is not observed. They all apply fully-connected embedding on location and time, but the information on the order and frequency of visits is not fully utilized. That information can ultimately improve the performance of the MSSRM model.

Table 2. Performance evaluation on two datasets w.r.t.Recall@K and NDCG@K
Dataset Method Recall@1 Recall @5 Recall @10 Recall @15 NDCG@1 NDCG@5 NDCG@10 NDCG@15
NYC Markov 0.1304 0.2606 0.3228 0.3522 0.1304 0.1992 0.2192 0.227
LSTM 0.1677 0.3761 0.466 0.5074 0.1677 0.2757 0.3049 0.3158
DeepMove 0.1988 0.4682 0.5644 0.6142 0.1988 0.3412 0.3724 0.3856
STGN 0.1699 0.3614 0.4433 0.488 0.1699 0.2703 0.2968 0.3166
SASRM* 0.1811 0.4014 0.4916 0.5354 0.1811 0.2965 0.3259 0.3376
LSTPM 0.2228 0.4821 0.5776 0.6319 0.2228 0.3586 0.3896 0.404
MSSRM 0.2266 0.5017 0.6015 0.6576 0.2266 0.3706 0.403 0.4178
Improvement (%) 1.69 4.06 4.14 4.07 1.69 3.35 3.45 3.44
TKY Markov 0.1222 0.2528 0.3004 0.3367 0.1222 0.1912 0.2067 0.2163
LSTM 0.1813 0.4515 0.5461 0.5956 0.1813 0.3234 0.3541 0.3673
DeepMove 0.211 0.4768 0.5654 0.6114 0.211 0.3514 0.3802 0.3924
STGN 0.1786 0.4349 0.5306 0.5818 0.1786 0.3128 0.3438 0.3574
SASRM* 0.1945 0.4657 0.5679 0.6206 0.1945 0.3363 0.3697 0.3836
LSTPM 0.2111 0.5178 0.6204 0.6726 0.2111 0.3717 0.4051 0.4189
MSSRM 0.2241 0.53 0.6425 0.6999 0.2241 0.3844 0.421 0.4362
Improvement (%) 5.76 2.31 3.45 3.9 5.76 3.31 3.78 3.95

Ablation Analysis
To verify the contribution of the components we employ in our model, we further employ two simplified variants of MSSRM to conduct ablation tests:

MSSRM-A:Compared to the LSTM model,this variant integratesspatio-temporal embedding from spatial feature module and temporal feature module and removes user embedding and self-attention mechanism.

MSSRM-U:This version only removes user embedding and while keeping spatio-temporal embedding and self-attention mechanism.

The results of the degraded versions of MSSRM on two datasets are shown in Table 3. Through the ablation tests, it can be seen that:

With well-designed location and time embedding, MSSRM-A always performs better than the LSTM model. This clearly proves effective the embedding method we replaced. Considering the overall trajectory structureand sharing the spatial context information with all users greatly improves the prediction performance.

Based on MSSRM-A, the MSSRM-U with a multi-head self-attention module can significantly improve the prediction results. This demonstrates that learning representations of locations in the different contextual environments and understanding the internal connections between locations in longer trajectories are very helpful for human mobility prediction.

Add user’s information as an embedding in the prediction stage, which can distinguish the preferences of different users. This can slightly improve the prediction performance.

Table 3. Performance of different MSSRM variants
Dataset Method Recall @1 Recall @5 Recall @10 Recall @15 NDCG@1 NDCG@5 NDCG@10 NDCG@15
NYC LSTM 0.1677 0.3761 0.466 0.5074 0.1677 0.2757 0.3049 0.3158
Best Baseline 0.2251 0.4877 0.587 0.6373 0.2251 0.363 0.3951 0.4084
MSSRM-A 0.1864 0.3937 0.4962 0.5535 0.1864 0.2954 0.3285 0.3438
MSSRM-U 0.2217 0.4848 0.5993 0.6462 0.2217 0.3589 0.3961 0.4085
MSSRM 0.2266 0.5017 0.6015 0.6576 0.2266 0.3706 0.403 0.4178
TKY LSTM 0.1813 0.4515 0.5461 0.5956 0.1813 0.3234 0.3541 0.3673
Best Baseline 0.2111 0.5178 0.6204 0.6726 0.2111 0.3717 0.4051 0.4189
MSSRM-A 0.1945 0.4702 0.5745 0.6277 0.1945 0.34 0.3739 0.388
MSSRM-U 0.2214 0.5233 0.6323 0.6857 0.2214 0.3792 0.4146 0.4288
MSSRM 0.2241 0.53 0.6425 0.6999 0.2241 0.3844 0.421 0.4362

Impact of Parameters
In general, the size of the unit in the LSTM hidden layer will influence the time of model training and the results of the experiment to a varying extent. We implement the MSSRM model on the benchmark datasets of NYC and TKY with different sizes of hidden layer units, and the influence on the accuracy of experiments is illustrated in Table 4, which shows that the model is stable under hyperparameter tuning. We will use d to represent the size of the hidden layer unit and set it to 256, 384, 512, and 640, respectively for evaluation. As Table 4 shows, on dataset NYC, all scores are significantly improved with the expansion of d. But on dataset TKY, all scores are floating in a small range, and there is no obvious linear trend. In model training, larger hidden size means that there are more training parameters, which may result in better performance on some datasets, or it may cause a significant increase in time but not a significant performance improvement.

Table 4. The influence of LSTM hidden sizer w.r.t.Recall@K and NDCG@K
Dataset Hidden size Recall @1 Recall @5 Recall @10 Recall @15 NDCG@1 NDCG@5 NDCG@10 NDCG@15
NYC 256 0.2104 0.4669 0.5627 0.6172 0.2104 0.3445 0.3758 0.3902
384 0.2157 0.4842 0.5812 0.6313 0.2157 0.3563 0.3879 0.4011
512 0.2266 0.5017 0.6015 0.6576 0.2266 0.3706 0.403 0.4178
640 0.2331 0.5095 0.6077 0.6536 0.2331 0.3783 0.4102 0.4224
TKY 256 0.2266 0.5389 0.6507 0.7042 0.2266 0.3898 0.4263 0.4405
384 0.2262 0.5426 0.6561 0.709 0.2262 0.3916 0.4286 0.4426
512 0.2241 0.53 0.6425 0.6999 0.2241 0.3844 0.421 0.4362
640 0.2231 0.5383 0.6542 0.7122 0.2231 0.3882 0.4259 0.4413
Furthermore, we also test the effect of different virtual grid sizes on model performance. Fig. 2 depicts the results of MSSRM applied on two datasets with the size of grid searched in [80×80, 100×100, 120×120].All the previous experiments are based on 100×100 virtual grids. On the dataset NYC, the real distance of the edge of the grid is 500×500 m, and about 400×400 m on dataset TKY correspondingly.As shown in Fig. 2, the indicators shrink as the grid size increases. This trend is easy to understand: a larger grid size means a higher resolution of the location mapping. The increase in the number of candidate locations brings more difficulty in predicting the next location. Hence, the grid size and the dimension of the hidden layer vector should be adjusted according to the demands of the actual prediction task, so as to achieve the ideal prediction performance.
Fig. 2. The influence of different grid size w.r.t.Recall@K and NDCG@K: (a) NYC dataset and (b) TKY dataset.

Conclusion

In this paper, we propose theMSSRM for human mobility prediction. In order to extract spatial and temporal contexts, all trajectories are encoded as a graph, and time distribution of visits is obtained, which brings spatial and temporal patterns so that patterns of sequence and periodicity can be perceived. Specifically, we adopt Node2Vec and Time2Vec before the recurrent neural network. The MSSRM possesses the advantage of partly utilizing the context information for all locations and learn the internal relationship among them by self-attention mechanism. The experimental results on two real-world datasets demonstrate that the proposed method outperforms several state-of-the-art models in recall and NDCG scores. In the future, we plan to consider generating high-quality personalized user embedding with more contextual information to improve the performance of human mobility prediction.

Author’s Contributions

Conceptualization, SW. Investigation and methodology, SW, XZ, RC. Resources, XZ. Supervision, XZ, YL. Writing of the original draft, SW, XZ. Writing of the review and editing, SW, XZ. Validation, SW, BL. Data Curation, RC, BL, YL.

Funding

This work was supported in part by the Natural Science Foundation of Chongqing, China (No. cstc2019jscx-mbdxX0021 and cstc2014kjrc-qnrc40002), and in part by the Major Industrial Technology Research and Development Project of Chongqing High-tech Industry(No. D2018-82).

Competing Interests

The authors declare that they have no competing interests.

Author Biography

Name: Shunjie Wen
Affiliation: Department of Computer Science and Technology, Chongqing University of Posts and Telecommunications
Biography:
Shunjie Wen received the B.S. degree in Communication Engineering from Chongqing University of Posts and Telecommunications, Chongqing, China in 2018. He is currently pursuing the M.S. degree in the Department of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing China. His research interests include human-centric computing, smart city based on multi-modal data, and trajectory data mining.

Name: Xu Zhang
Affiliation: Department of Computer Science and Technology, Chongqing University of Posts and Telecommunications
Biography:
Xu Zhang received his Ph.D. in Computer and Information Engineering at Inha University, Incheon, Korea, in 2013. He is currently an associate professor at Chongqing University of Posts and Telecommunications. His research interests include urban computing, intelligent transportation system, and few-shot learning.

Name: Ruixu Cao
Affiliation: Department of Computer Science and Technology, Chongqing University of Posts and Telecommunications
Biography:
Ruixu Cao is currently pursuing the M.S. degree from the Department of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing China. His research interests include traffic data mining and intelligent transportation systems.

Name: Boming Li
Affiliation: Department of Computer Science and Technology, Chongqing University of Posts and Telecommunications
Biography:
Boming Li received the M.S. degree in Computer Sciences from Chongqing University of Posts and Telecommunications, Chongqing, China in 2020. His research interests include intelligent transportation systems and trajectory data mining.

Name: Yan Li
Affiliation: Department of Electrical and Computer Engineering, Inha University
Biography:
Yan Li received the M.S. and Ph.D.degrees in the Department of Computer Science and Engineeringfrom Inha University, Incheon, Korea,in 2008and 2016. Now she is an assistant professor in the Department of Electrical and Computer Engineering at InhaUniversity. Her research interestsinclude cloud computing, crowdsensing, big data analytics.

References

[1] M. Schlapfer, L. Dong, K. O’Keeffe, P. Santi, M. Szell, H. Salat, S. Anklesaria, M. Vazifeh, C. Ratti, and G. B. West, “The universal visitation law of human mobility,” Nature, vol. 593, no. 7860, pp. 522-527, 2021.
[2] L. Chen, H. N. Liang, F. Lu, K. Papangelis, K. L. Man, and Y. Yue, “Collaborative behavior, performance and engagement with visual analytics tasks using mobile devices,” Human-centric Computing and Information Sciences, vol. 10, article no. 47, 2020. https://doi.org/10.1186/s13673-020-00253-7
[3] Y. Liao, “Hot spot analysis of tourist attractions based on stay point spatial clustering,” Journal of Information Processing Systems, vol. 16, no. 4, pp. 750-759, 2020.
[4] F. Xia, J. Wang, X. Kong, D. Zhang, and Z. Wang, “Ranking station importance with human mobility patterns using subway network datasets,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 7, pp. 2840-2852, 2020.
[5] K. Sun, T. Qian, T. Chen, Y. Liang, Q. V. H. Nguyen, and H. Yin, “Where to go next: modeling long-and short-term user preferences for point-of-interest recommendation,” in Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, 2020, pp. 214-221.
[6] S. A. Khowaja, B. N. Yahya, and S. L. Lee, “CAPHAR: context-aware personalized human activity recognition using associative learning in smart environments,” Human-centric Computing and Information Sciences, vol. 10, article no. 35, 2020. https://doi.org/10.1186/s13673-020-00240-y
[7] G. Cicceri, F. De Vita, D. Bruneo, G. Merlino, A. Puliafito, “A deep learning approach for pressure ulcer prevention using wearable computing,” Human-centric Computing and Information Sciences, vol. 10, article no. 5, 2020. https://doi.org/10.1186/s13673-020-0211-8
[8] D. Kong and F. Wu, “HST-LSTM: a hierarchical spatial-temporal long-short term memory network for location prediction,” in Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, 2018, pp. 2341-2347.
[9] X. Li, G. Cong, A. Sun, and Y. Cheng, “Learning travel time distributions with deep generative model,” in Proceedings of the World Wide Web Conference (WWW), San Francisco, CA, 2019, pp. 1017-1027.
[10] J. Feng, Y. Li, Z. Yang, Q. Qiu, and D. Jin, “Predicting human mobility with semantic motivation via multi-task attentional recurrent networks. IEEE Transactions on Knowledge and Data Engineering, 2020. https://doi.org/10.1109/TKDE.2020.3006048
[11] H. Li, Y. Ge, R. Hong, and H. Zhu, “Point-of-interest recommendations: Learning potential check-ins from friends,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, 2016, pp. 975-984.
[12] P. Zhao, A. Luo, Y. Liu, F. Zhuang, J. Xu, Z. Li, F. Zhuang, V. S. Sheng, and X. Zhou, “Where to go next: a spatio-temporal gated network for next POI recommendation,” in Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI), Honolulu, HI, 2019, pp. 5877-5884.
[13] L. Sun, “POI recommendation method based on multi-source information fusion using deep learning in location-based social networks,” Journal of Information Processing Systems, vol. 17, no. 2, pp. 352-368, 2021.
[14] J. Zhang, Y. Zheng, D. Qi, R. Li, X. Yi, and T. Li, “Predicting citywide crowd flows using deep spatio-temporal residual networks,” Artificial Intelligence, vol. 259, pp. 147-166, 2018.
[15] X. Zhang, R. Cao, Z. Zhang, and Y. Xia, “Crowd flow forecasting with multi-graph neural networks,” in Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, United Kingdom, 2020, pp. 1-7.
[16] C. Miao, J. Fu, J. Wang, H. Yu, B. Yao, A. Zhong, J. Chen, and Z. He, “Predicting crowd flows via pyramid dilated deeper spatial-temporal network,” in Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Virtual Event, Israel, 2021, pp. 806-814.
[17] R. Herberth, L. Menz, S. Korper, C. Luo, F. Gauterin, A. Gerlicher, and Q. Wang, “Identifying atypical travel patterns for improved medium-term mobility prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 12, pp. 5010-5021, 2020.
[18] H. Barbosa, M. Barthelemy, G. Ghoshal, C. R. James, M. Lenormand, T. Louail, R. Menezes, J. J. Ramasco, F. Simini, and M. Tomasini, “Human mobility: models and applications,” Physics Reports, vol. 734, pp. 1-74, 2018.
[19] S. Gambs, M. O. Killijian, and M. N. del Prado Cortez, “Next place prediction using mobility Markov chains,” in Proceedings of the 1st Workshop on Measurement, Privacy, And Mobility, Bern, Switzerland, 2012, pp. 1-6.
[20] A. Asahara, K. Maruyama, A. Sato, and K. Seto, “Pedestrian-movement prediction based on mixed Markov-chain model,” in Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Chicago, IL, 2011, pp. 25-33.
[21] Q. Liu, S. Wu, L. Wang, and T. Tan, “Predicting the next location: a recurrent model with spatial and temporal contexts,” in Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, AZ, 2016, pp. 194-200.
[22] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[23] J. Feng, Y. Li, C. Zhang, F. Sun, F. Meng, A. Guo, and D. Jin, “DeepMove: predicting human mobility with attentional recurrent networks,” in Proceedings of the 2018 World Wide Web Conference, Geneva, Switzerland, 2018, pp. 1459-1468.
[24] X. Zhang, B. Li, C. Song, Z. Huang, and Y. Li, “SASRM: a semantic and attention spatio-temporal recurrent model for next location prediction,” in Proceedings of 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 2020, pp. 1-8.
[25] M. Chen, Y. Liu, and X. Yu, “NLPMM: a next location predictor with Markov modeling,” in Advances in Knowledge Discovery and Data Mining. Cham, Switzerland: Springer, 2014, pp. 186-197.
[26] H. Liu, G. Wu, and G. Wang, “Tell me where to go and what to do next, but do not bother me,” in Proceedings of the 8th ACM Conference on Recommender System, Silicon Valley, CA, 2014,pp. 375-376.
[27] S. Feng, X. Li, Y. Zeng, G. Cong, Y. M. Chee, and Q. Yuan, “Personalized ranking metric embedding for next new poi recommendation,” in Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 2015, pp. 2069-2075.
[28] X. Li, G. Cong, X. L. Li, T. A. N. Pham, and S. Krishnaswamy, “Rank-GeoFM: a ranking based geographical factorization method for point of interest recommendation,” in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, 2015, pp. 433-442.
[29] Y. Zhu, H. Li, Y. Liao, B. Wang, Z. Guan, H. Liu, and D. Cai, “What to do next: modeling user behaviors by time-LSTM,” in Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), Melbourne, Australia, 2017, pp. 3602-3608.
[30] J. Manotumruksa, C. Macdonald, and I. Ounis, “A contextual attention recurrent architecture for context-aware venue recommendation,” in Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, 2018, pp. 555-564.
[31] L. Huang, Y. Ma, S. Wang, and Y. Liu, “An attention-based spatiotemporal LSTM network for next POI recommendation. IEEE Transactions on Services Computing, 2019. https://doi.org/10.1109/TSC.2019.2918310
[32] A. Grover and J. Leskovec, “node2vec: scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, 2016, pp. 855-864.
[33] S. M. Kazemi, R. Goel, S. Eghbali, J. Ramanan, J. Sahota, S. Thakur, S. Wu, C. Smyth, P. Poupart, and M. Brubaker, “Time2vec: learning a vector representation of time,” 2019 [Online]. Available: https://arxiv.org/abs/1907.05321.
[34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, pp. 5998-6008, 2017.
[35] D. Yang, D. Zhang, V. W. Zheng, and Z. Yu, “Modeling user activity preference by leveraging user spatial temporal characteristics in LBSNs,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 45, no. 1, pp. 129-142, 2015.
[36] D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” 2017 [Online]. Available: http://arxiv.org/abs/1412.6980.
[37] T. Chen, H. Yin, Q. V. H. Nguyen, W. C. Peng, X. Li, and X. Zhou, “Sequence-aware factorization machines for temporal predictive analytics,” in Proceedings of 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, 2020, pp. 1405-1416.

Shunjie Wen1,2, Xu Zhang1,2,*, Ruixu Cao1, Boming Li1, and Yan Li3, MSSRM: A Multi-Embedding Based Self-Attention Spatio-temporal Recurrent Model for Human Mobility Prediction, Article number: 11:37 (2021) Cite this article 3 Accesses