ArticlesAll Issue
ArticlesMulti-Scale Feature-Based Spatiotemporal Pyramid Network for Hand Gesture Recognition
• Zongjing Cao, Yan Li, and Byeong-Seok Shin*

Human-centric Computing and Information Sciences volume 12, Article number: 46 (2022)
https://doi.org/10.22967/HCIS.2022.12.046

Abstract

Effectively capturing the spatiotemporal features of hand gestures from sequence data is crucial for gesture recognition. Existing work has effectively obtained motion features from between neighboring frames through well-designed temporal modeling networks; however, less attention has been paid to the spatial information contained in each frame. These approaches ignore the implicit complementary advantages of multi-scale appearance representations, which are essential to gesture recognition. We propose a multi-scale, feature-based spatiotemporal pyramid network for hand gesture recognition. It has a top-down, lateral-connection architecture designed to fuse spatial and temporal features from multiple scales in each layer. The network first outputs a coarse feature in a feedforward pass and then refines this feature in the top-down pass using features from successive lower layers. Similar to skip connections, our approach uses features from each layer of the network, but does not attempt to output independent predictions in each layer. Furthermore, we introduce a spatiotemporal pyramid module formed by stacking multiple successive refinement modules to fuse the multi-scale spatial feature output from each layer. We evaluate the proposed model with two publicly available benchmark hand gesture datasets. The model achieved accuracies of 85.1% and 95.4% for depth modality in the NVGesture and EgoGesture datasets, respectively. The comparison results show that the proposed hand gesture recognition method outperforms existing state-of-the-art methods.

Keywords

Deep Learning, Hand Gesture Recognition, Pyramid Network, Spatiotemporal Feature

Introduction

The hand gesture is a form of non-vocal and non-verbal communication that uses visible hand actions to communicate a specific message. Compared to other forms, gestures can make communication more convenient, intuitive, and natural, especially in noisy environments. Hand gesture recognition based on computer vision technology refers to the entire process from tracking a hand gesture to representing it and converting it into semantically meaningful commands [16]. Gesture recognition can be viewed as a way for computers to understand human body language and is an important part of human-computer interaction. Hand gesture recognition has become a research hotspot in the field of computer vision due to its wide application in human-computer interaction, such as in virtual reality games, robot control, and smart home systems.
Human actions are usually scene-related; for example, actions such as playing basketball, horseback riding, or playing the piano usually occur in a specific scene. As pointed out in [7], most of these categories of actions can be recognized with background scene information, while the temporal information of the action is not important. Unlike human action recognition, gesture recognition is more related to temporal information. For example, zooming in and out of the screen with two fingers has similar features in the spatial domain, but the temporal features are reversed. In addition, for gesture recognition, background scene information is almost unusable or even a noise. Therefore, it is considerable for hand gesture recognition tasks to effectively extract motion features from sequence data. The current work mainly uses optical flow or 3D convolutional neural networks (ConvNets) to capture the dynamic motion features of the action [8, 9]. Although these methods effectively obtain motion information between neighboring frames through well-designed temporal modeling networks, they ignore crucial spatial information contained in a single frame. As shown in Fig. 1, we observed that in natural scenes, the same gesture may appear at multiple scales in different scenes due to differences in user location and operating habits. Moreover, the essential contextual information of a gesture may occupy a considerably larger area than the gesture itself. If a gesture recognition system cannot effectively focus on the multi-scale spatial information of gestures, even a carefully designed temporal modeling network may fail to recognize gestures accurately due to detecting too much noise. Therefore, the perception of gestures at different scales or multi-scale representations of gestures can lead to further improvements in model performance. Motivated by these observations, we proposed a multi-scale feature-based pyramid network for hand gesture recognition. The proposed network fuses spatiotemporal features from each layer of multiple sizes through a top-down, horizontally connected architecture.

Fig. 1. Examples of sampling the same hand gesture from multiple videos.

Recognizing objects at different scales is a fundamental challenge in computer vision tasks. Feature pyramids built on image pyramids are one of the useful ways to solve this challenge [10, 11]. These featured image pyramids are scale-invariant because a change in the scale of an object can be offset by shifting its level in the pyramid. Since the features are processed independently at each image scale leads to slow computational speed. In deep ConvNets, the lower layer captures rich spatial information while the upper layer encodes object-level knowledge but is invariant to factors such as pose and appearance [10–12]. The top layer has high-level visual perception but ignores important local information, whereas the bottom layer has higher-resolution features and preserves essential image details, but lacks semantics. That is, the features at different depths in a single ConvNets model already capture rich spatial information. However, the existing gesture recognition methods are based on the idea of image classification and use only the top-level appearance features of sequence data. These methods ignore the implicit complementary advantages of multi-scale appearance representations, which are essential for gesture recognition. We propose a spatiotemporal pyramid network that combines low-resolution strong semantic features with high-resolution weak semantic features via a top-down pathway and lateral connections. The network first outputs a rough feature in the feedforward pass and then gradually refines this feature using features from successive lower layers in the top-down pass. Similar to skip connections, the proposed method uses features from each ConvNets layer, but does not attempt to output independent predictions at each layer, thus effectively reducing the computational consumption of the model.
The main contributions of our work can be summarized as follows.
(1) We propose a novel top-down architecture with lateral connections to fuse spatial and temporal pyramid features at all scales in sequence data. The proposed approach utilizes the inherent multi-scale, pyramidal hierarchy of ConvNets to build feature pyramids.
(2) We introduce a refinement module to merge multi-scale features from each layer. Each refinement module is responsible for matching the features generated by the top-down pass with those generated by the bottom-up pass to generate a new feature with twice the spatial resolution.
(3) We evaluate several effective data-augmentation methods and learning-rate-setting strategies to overcome the problem of model overfitting caused by a limited number of training samples, thus improving the model’s performance.
The remainder of this paper is organized as follows. In Section 2, we present some related work and background on gesture recognition techniques. In Section 3, we describe our proposed approach in detail. The results of the comparison with other state-of-the-art methods are reported in Section 4. Section 5 contains the conclusion and some suggestions for further research.

Related Work

One of the main steps of a hand gesture recognition system is to capture the spatiotemporal information of gestures. In traditional action recognition methods, hand-crafted features are usually extracted first, and then classified using a classifier. In the past few decades, several works have focused on designing appropriate features, such as histograms of oriented gradients or histograms of optical flow. In [13], the authors proposed the improved dense trajectory features, which are currently the best hand-crafted feature for action recognition tasks. Despite its good performance, it does not apply to large-scale datasets due to its high computational resource consumption.
Since AlexNet [14], ConvNets-based approaches have achieved state-of-the-art results in many vision tasks, such as image recognition, segmentation, and detection. With this trend, researchers have tried to use ConvNets for action recognition tasks in the video domain. In [15], the authors used stacked video frames as input to train a ConvNets for video classification. This was the first approach to using deep learning to perform video classification tasks. However, since only appearance features are used, its performance is worse than the best handcrafted shallow representations, such as improved dense trajectory features [13]. To address this issue, Simonyan and Zisserman [8] proposed a two-stream ConvNets model for action recognition, which uses two ConvNets to process the visual information contained in each frame and the motion of neighboring frames separately. However, two-stream networks are based only on short-time modeling, and the action is determined by a single moment of the frame and optical flow, which makes it impossible to handle videos with long-range temporal structures. To learn the long-range temporal structure of the actions, Wanget al. [16] proposed a temporal segment network (TSN) for action recognition. The main idea of TSN is to divide the entire video evenly into several segments and process each segment separately. However, since both two-stream networks [8] and TSN [16] require pre-calculated optical flow information of the action, they are not suitable for real-time recognition.
Compared with 2D ConvNets, 3D ConvNets are more suitable for spatiotemporal feature learning. Tran et al. [17] proposed a method to extract action spatiotemporal features of actions using 3D ConvNets. Due to a large number of parameters in 3D ConvNets and the lack of an effective pre-trained model, the performance of 3D ConvNets is worse than that of two-stream networks. A 2D ConvNets have low computing costs but cannot capture the temporal information of action. A 3D ConvNets can provide good performance but cannot perform real-time detection due to the large number of computational resources required. Due to the exponential increase in parameters and computations of 3D ConvNets compared to 2D ConvNets, several recent works have proposed 3D ConvNets models based on decomposed convolution kernels, such as I3D [18], S3D [19], and R(2+1)D [20]. In literature [21], Lin proposed a temporal shift module (TSM) that shifts part of the channels along the temporal dimension to exchange information about neighboring frames. The TSM can be inserted into the backbone of a 2D ConvNets for joint spatiotemporal modeling. TSM can achieve the performance of a 3D ConvNets but maintains the complexity of a 2D ConvNets. In this work, we use TSM as the backbone residual module to learn the temporal features of gestures in sequence data.

Fig. 2. Architecture of the proposed spatiotemporal pyramid network.

Temporal shift module: Previous works [9, 22] captured dynamic motion information of gestures using two-stream networks or 3D ConvNets. Two-stream networks show strong performance in gesture recognition tasks, but processing requires costly multi-branch networks, consuming a large number of computational resources. Two-stream networks using optical flow take a long time to pre-generate the corresponding optical flow features. Approaches based on 3D ConvNets can achieve good performance but suffer from overfitting and slow convergence due to their computationally intensive processes and dataset limitations. A temporal shift module was used to learn spatiotemporal features of gestures in videos.
For a 1D vector with input X, the convolution operation can be expressed as: Y=Convolution(W, X), here we assume that the parameters W of the convolution are (w1, w2, w3), then the Y can be re-expressed as Yi=w1Xi-1+w2Xi+w3Xi+1. We can decouple the convolution operation into two parts: shift and multiply-accumulate. The shift operation can be expressed as:

(1)

and the multiply-accumulate operation can be calculated using the following equation:

(2)

Fig. 3 shows the internal structure of the TSM residual module. The TSM shifts part of the channels along the temporal dimension to exchange information with neighboring frames. As summarized in Table 1, the TSM was inserted into each residual block—that is, from layer 2 to layer 4. Three, four, six, and three modules were inserted into layers 1, 2, 3, and 4, respectively. Moreover, a TSN was used to model the dynamics and to use visual information from an entire video to make gesture predictions.

Fig. 3. Temporal shift residual module.

Table 1. Structure of the backbone network
 Stage Layer Output size Input - T X 224 X 224 Conv 1 1 X 1 X 1, 64, Stride 1, 2, 2 T X 112 X 112 Pool 1 1 X 3 X 3, Max, Stride 1, 2, 2 T X 56 X 56 Layer 1 T X 56 X 56 Layer 2 T X 28 X 28 Layer 3 T X 14 X 14 Layer 4 T X 7 X 7 Global average pool, FC - T X N Classes Temporal average - N Classes

Spatial pyramid module: Representing features at multiple scales is important for hand gesture recognition tasks. The goal of the spatial pyramid module is to extract spatiotemporal pyramid features for hand gesture recognition in sequence data. We used ResNet-50 as the backbone network to extract representations of different levels, considering its balance between accuracy and complexity. As shown in Fig. 2, we chose the outputs of layers 1, 2, 3, and 4 as multi-scale spatial feature maps. The spatial pyramid module exploits the inherent hierarchy of spatial ResNet-50. Different from others, our method does not attempt to output independent predictions at each layer. Instead, the module first outputs a coarse feature in a feedforward pass, and then refines feature encoding in the top-down pass using features from successive lower layers. Finally outputting a feature pyramid containing rich spatial features at different scales for gesture prediction.

Refinement module: To fuse multi-scale features from different layers, we introduce the refinement module and stacked successive modules to form the spatiotemporal pyramid module. As shown in Fig. 2, the feedforward pathway of the network outputs a low-resolution, but semantically meaningful, feature map: $M^1$ with $k_{m^1}$ channels. $F^i$ serves as input for the top-down pyramid module. Each refinement module, $R^i$, aggregates the information from top layer $M^1$ and features $F^i$ from the corresponding layer of the bottom-up computation. In their construction, the two features ($M^1$ and $F^i$) have the same spatial dimensions. The goal of the refinement module is to generate a new feature, $M^{(i+1)}$, with double-spatial resolution based on inputs $M^1$ and $F^i$.

Loss Function
An input video, V, can be divided into m segments of equal duration (S1, S2, …, Sm). Then, a sequence of snippets can be modeled by the TSN, which can be expressed as follows:

(3)

where X1, X2, …, Xm are short snippets randomly sampled from their corresponding segments, S1, S2, …, Sm, and f(Xm; W) is a function of the ConvNets with parameter W. Function f operates on the short snippet, Xm, and outputs the predicted scores for that snippet corresponding to all classes. Segmental consensus function g combines the outputs obtained from all short snippets, and finally outputs the consensus of the class hypothesis from among all segments. H is the prediction function used to predict the probability that the entire video belongs to each action class. In this work, we use the softmax function as prediction function H, which is a widely used scoring function for computing the probability distribution of output categories. Combined with the standard categorical cross-entropy loss, the final loss function, L, of the model can be represented as

(4)

where G represents the segmental consensus of G = g(f($X_1$; W), f($X_2$; W), …, f($X_k$; W)), yi denotes the target concerning class i, and N indicates the number of gesture classes. For simplicity, we used the simplest form of g, which can be expressed as Gi = g($f_i$($X_1$), f($X_2$), …, $f_i$($X_k$)). Here, Gi is computed from the scores of the same class in all segments using segmental consensus function g. For segmental consensus function g, three commonly used calculation methods are maximum averaging, weighted averaging, and even averaging. We use the weighted averaging method.

Fig. 4. Sample frames from the datasets: (a) NVGesture and (b) EgoGesture datasets.

Table 2. Details of datasets used in our experiments
 Dataset Class Training Validation Testing Total NVGesture 25 750 300 482 1,532 EgoGesture 83 14,416 4,768 4,977 24,161

Experimental Setting
The proposed network was implemented using the PyTorch deep learning framework. All models were pre-trained with the Kinetics400 dataset. In the training phase, a frame was randomly cropped from four corners and the center and was then resized to 240X320 pixels for the input frame. Then, the input frames were randomly sampled using multi-scale cropping at scale ratios of 1.0, 0.875, and 0.75. Finally, these cropped frames were resized to 224X224 pixels, normalized based on mean and standard deviation, and input to the network for training. In the inference phase, only center cropping, resizing, and normalization was performed on the input frames. The model was trained and validated on a server equipped with two NVIDIA GeForce RTX 3090 GPUs. The parameters of the model were initialized using weights pre-trained on the Kinetics400 dataset. Hand gesture recognition task experiments were conducted following a previously described strategy from [16]. An input video was first divided into S equal-length segments. Then, a frame was randomly selected from each segment to obtain a clip with S frames. For each frame, data preprocessing was performed using the method described above. Each cropped frame was finally resized to 224X224 pixels and used to train the model. The final input size of the model was NXSX224X224, where N indicates the batch size. Segments S were set to 8, 16, and 32.
A mini-batch stochastic gradient descent function was adopted as the optimizer, with a momentum of 0.9 and a weight decay of 0.005. The model was trained for 60 epochs, and mini-batch size N was set to 32, 16, and 8 when segments S were 8, 16, and 32, respectively. The initial learning rate was 0.001. To avoid a sudden increase in the learning rate and ensure healthy convergence at the beginning of training, 10 epochs were used, linearly increasing the learning rate from 0.001 to 0.01. The remaining learning rate decayed from the initial value to 0 using the cosine function. A detailed description of the learning-rate-setting strategy is presented in Section 4.3. In the inference phase, a previously described setting from [21] was used to randomly sample 16 clips from the entire video. The final gesture prediction was obtained by averaging the scores of all the segments.

Training Refinements
Setting the learning rate is extremely important in training neural networks. To further improve the accuracy of the model, two learning-rate adjustment strategies were used in this work. At the beginning of training for a hand gesture recognition model, the weight parameters of all neural networks are typically random values, which is far from an optimal global solution. To avoid a sudden increase in the learning rate, and to ensure healthy convergence at the beginning of training, Goyal et al. [25] proposed a gradual warm-up strategy that increases the learning rate linearly. For example, assuming that the initial learning rate is l and the first n batches are used for warm-up, the learning rate for the i-th batch (li) is set to li = il/n, 1 ≤ i ≤ n. Ten epochs were used to increase the learning rate linearly from 0.0001 to 0.001.
The adjustment of the learning rate during the training of a deep neural network is also critical. A widely used adjustment strategy is exponential decaying the learning rate. In [26], the authors decreased the learning rate by 0.1 every 30 epochs. Here, this approach is referred to as step decay, whereas the method proposed by Loshchilov and Hutter [27] is referred to as cosine decay. The key idea is to follow the cosine function that gradually decreases the learning rate from the initial value to zero. Mathematically, assuming that the total number of batches is n and the initial learning rate is l, then learning rate li for batch i can be calculated using the following equation:

(5)

It took 50 epochs for the learning rate to decay from the initial value of 0.001 to 0. The learning rate curves with warm-up and cosine decays are shown in Fig. 5. The curves comparing step and cosine decays are also shown in Fig. 5, indicated by the solid red and dashed pink lines, respectively.

Fig. 5. Visualization of learning rate curves for warm-up and cosine decay.

Results and Analysis

Learning curves: A learning curve is a plot of a model learning performance over time. Learning curves are an effective tool for evaluating the learning and generalization behavior of deep learning models. Train learning curve gives an idea of how well the model is learning, while the validation learning curve is used to evaluate how well the model is generalizing. Fig. 6 shows the training accuracy and loss curves of our model for the training and validation phase on two datasets. Where the x-axis is time, the primary y-axis represents accuracy, and the secondary y-axis is loss. As can be seen from Fig. 6, the validation accuracy increases to a stable point and has a small gap with the training accuracy. This shows that our model has a good fit on two datasets.

Fig. 6. Accuracy and loss curves for the model when using two datasets: (a) NVGesture and (b) EgoGesture dataset

Table 3. Accuracy (%) comparisons between the proposed method and other state-of-the-art methods
 Method Backbone network Pretraining dataset Frames per video NVGesture EgoGesture Color Depth Color Depth TSN ResNet-50 Kinetics400 8 N/A N/A 93.1 N/A 16 N/A N/A N/A N/A TSM ResNet-50 Kinetics400 8 N/A N/A 92.1 N/A 16 N/A N/A N/A N/A ACTION ResNet-50 ImageNet 8 N/A N/A 94.2 N/A 16 N/A N/A 94.4 N/A I3D Inception-v1 Kinetics400 - 78.3 82.2 90.3 89.4 C3D C3D-8 Sports-1M 16 69.3 78.8 86.4 88.1 3DCNN C3D-8 Sports-1M 8 74.1 80.3 78.4 82.2 MTUT Inception-v1 Kinetics400 - 81.3 84.8 92.4 91.9 ResNeXt ResNeXt-101 Jester 8 N/A N/A N/A N/A 16 66.4 72.8 90.9 91.8 24 72.4 79.2 92.8 93.4 32 78.6 83.8 93.7 94 Proposed method ResNet-50 Kinetics400 8 82.9 84.2 92.3 93.5 16 83.7 84.9 94.5 95.4 24 83.3 85.1 94.6 94.8
The bold font indicates the best performance obtained by our network

Comparison with the existing methods: The division scheme provided by the compilers of both datasets was followed, and the original datasets were split into the same three subsets: training, validation, and testing. Mean classification accuracy was the metric used to evaluate performance. To compare the proposed model with existing approaches, standard evaluation protocols were used, and the highest accuracy with EgoGesture and NVGesture was calculated. The performance of the proposed spatiotemporal pyramid network with the two datasets was compared to eight state-of-the-art methods for hand gesture recognition: TSN, TSM, ACTION [28], I3D [18], C3D [17], 3DCNN [9], MTUT [29], and ResNeXt [22]. The proposed methods were evaluated separately using color and depth modalities. The comparison results are summarized in Table 3. The proposed model achieved recognition accuracies of 94.5% and 95.4% for color and depth, respectively, with EgoGesture. With NVGesture, recognition accuracies reached 83.7% and 84.9% for color and depth, respectively. The comparison results showed that our proposed network outperformed all other methods. The model using depth modality achieved higher recognition accuracy than when using color owing to the elimination of background noise. Moreover, the results showed that splitting a video into 16 and 24 clips achieved almost consistent performance with the proposed method.

F1-score: Precision, recall, and F1-score are common metrics used to evaluate the performance of gesture recognition models [30, 31]. The precision and recall can be defined using the following equations:

(6)

(7)

where TP indicates the number of true positives, FP indicates the number of false positives and FN indicates the number of negatives. The F1-score can be interpreted as a harmonic mean of the precision and recall, which can be defined as follows:

(8)

where the F1-score reaches its best value at 1 and worst score at 0. Table 4 shows the F1-score of our proposed spatiotemporal pyramid network on two datasets. Our model achieves an F1-score of 0.83 and 0.94 on NVGesture and EgoGesture datasets, respectively.

Table 4. Precision, recall, and F1-score of our proposed model on two datasets
 Datasets Precision (%) Recall (%) F1-score (%) Color Depth Color Depth Color Depth NVGesture 84.22 86.1 83.83 85.52 83.68 85.5 EgoGesture 94.62 95.46 94.53 95.41 94.53 95.49

Visualization of two gestures: To gain further insight into the learning of our proposed network. We visualize the feature extraction capability of our proposed network by using class activation mapping. class activation mapping shows what ConvNets particularly focuses on by overlaying heat maps on the original frames. The visualization results of the two gestures are shown in Fig. 7. The first row is the presented video sequence, and the second row is the class activation mapping obtained by our proposed model. It can be seen that the proposed approach is able to pay close attention to the regions where gestures appear in different video sequences. The visualization results further illustrate the ability of our network to extract gesture-related features.

Fig. 7. Visualization of two actions: (a) moving a hand toward the left and (b) zooming out with the fingers

Conclusion

In this work, we developed a real-time hand gesture recognition system based on spatiotemporal pyramid networks. We used a feature-level spatial pyramid module to aggregate multi-scale appearance features, and a TSM to model temporal information. Moreover, we introduce a spatiotemporal pyramid module formed by stacking multiple successive refinement modules to fuse the multi-scale spatial feature from different layers. Similar to skip connections, our approach uses features from each layer of the network, but does not attempt to output independent predictions in each layer. We evaluated the proposed method with two benchmark datasets. The experimental results indicate that the proposed hand gesture detection and recognition method outperformed existing approaches.

Author’s Contributions

Conceptualization: ZC. Investigation and methodology: ZC, YL. Project administration: ZC, YL, BSS. Supervision: BSS. Writing the original draft: ZC. Writing the review, and editing: ZC, YL, BSS. Software: ZC, YL. Validation: ZC, YL. All authors have proofread the final version.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (No. NRF-2022R1A2B5B01001553). This work was also supported by the NRF grant funded by the Korean government (No. NRF-2022R1A4A1033549) and in part by the China Scholarship Council.

Competing Interests

The authors declare that they have no competing interests.

Author Biography

Name : Zongjing Cao
Affiliation : Department of Electrical and Computer Engineering, Inha University.
Biography : He is currently pursuing the Ph.D. degree in computer science and engineering with Inha University, Incheon, South Korea. His current research interests include image processing, computer vision, and deep learning. Contact him at zjcao@inha.edu.

Name : Yan Li
Affiliation : Department of Electrical and Computer Engineering, Inha University.
Biography : Yan Li is an assistant professor in the Department of Electrical and Computer Engineering at Inha University. Her research interests include cloud computing, crowd sensing, big data analytics. Yan has a doctor’s degree in from Inha University, Korea. Contact her at leeyeon@inha.ac.kr.

Name : Byeong-Seok Shin
Affiliation : Department of Electrical and Computer Engineering, Inha University.
Biography : Byeong-Seok Shin (Corresponding author) is a professor in the Department of Electrical and Computer Engineering at Inha University Korea. His research interests include Medical Imaging, Volume Visualization, and Real-time Rendering. He is a member of IEEE and ACM. Contact him at bsshin@inha.ac.kr.

References

[1] F. Zhang, T. Y. Wu, J. S. Pan, G. Ding, and Z. Li, “Human motion recognition based on SVM in VR art media interaction environment,” Human-centric Computing and Information Sciences, vol. 9, article no. 40, https://doi.org/10.1186/s13673-019-0203-8
[2] H. Wu, W. Luo, N. Pan, S. Nan, Y. Deng, S. Fu, and L. Yang, “Understanding freehand gestures: a study of freehand gestural interaction for immersive VR shopping applications,” Human-centric Computing and Information Sciences, vol. 9, article no. 43, 2019. https://doi.org/10.1186/s13673-019-0204-7
[3] R. Jafri and H. R. Arabnia, “A survey of face recognition techniques,” Journal of Information Processing Systems, vol. 5, no. 2, pp. 41-68, 2009.
[4] Y. Yang, L. Li, Z. Liu, and G. Liu, “Abnormal behavior recognition based on spatio-temporal context,” Journal of Information Processing Systems, vol. 16, no. 3, pp. 612-628, 2020.
[5] X. X. Wang and Y. Shen, “A video traffic flow detection system based on machine vision,” Journal of Information Processing Systems, vol. 15, no. 5, pp. 1218-1230, 2019.
[6] T. M. Li, H. C. Chao, and J. Zhang, “Emotion classification based on brain wave: a survey,” Human-centric Computing and Information Sciences, vol. 9, article no. 42, 2019. https://doi.org/10.1186/s13673-019-0201-x
[7] Y. Li, B. Ji, X. Shi, J. Zhang, B. Kang, and L. Wang, “TEA: temporal excitation and aggregation for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Seattle, WA, 2020, pp. 906-915.
[8] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition,” Advances in Neural Information Processing Systems, vol. 27, pp. 568-576, 2015.
[9] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, pp. 4207-4215.
[10] T. Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 2017, pp. 936-944.
[11] P. O. Pinheiro, T. Y. Lin, R. Collobert, and P. Dollar, “Learning to refine object segments,” in Computer Vision – ECCV 2016. Cham, Switzerland: Springer, 2016, pp. 75-91.
[12] Y. Zhang, C. Cao, J. Cheng, and H. Lu, “EgoGesture: a new dataset and benchmark for egocentric hand gesture recognition,” IEEE Transactions on Multimedia, vol. 20, no. 5, pp. 1038-1050, 2018.
[13] H. Wang and C. Schmid, “Action recognition with improved trajectories,” in Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 2013, pp. 3551-3558.
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84-90, 2017.
[15] B. SravyaPranati, D. Suma, C. ManjuLatha, and S. Putheti, “Large-scale video classification with convolutional neural networks,” in Information and Communication Technology for Intelligent Systems. Singapore: Springer, 2020, pp. 689-695.
[16] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool, “Temporal segment networks: towards good practices for deep action recognition,” in Computer Vision – ECCV 2016. Cham, Switzerland: Springer, 2016, pp. 20-36.
[17] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 2015, pp. 4489-4497.
[18] J. Carreira and A. Zisserman, “Quo Vadis, action recognition? A new model and the kinetics dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 2017, pp. 4724-4733.
[19] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification,” in Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 2018, pp. 318-335.
[20] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 6450-6459.
[21] J. Lin, C. Gan, and S. Han, “TSM: temporal shift module for efficient video understanding,” in Proceedings of the IEEE International Conference on Computer Vision, Seoul, South Korea, 2019, pp. 7082-7092).
[22] O. Kopuklu, A. Gunduz, N. Kose, and G. Rigoll, “Real-time hand gesture detection and classification using convolutional neural networks,” in Proceedings of 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG), Lille, France, 2019, pp. 1-8.
[23] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, et al., “The kinetics human action video dataset,” 2017 [Online]. Available: https://arxiv.org/abs/1705.06950.
[24] C. Cao, Y. Zhang, Y. Wu, H. Lu, and J. Cheng, “Egocentric gesture recognition using recurrent 3D convolutional neural networks with spatiotemporal transformer modules,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 3783-3791.
[25] P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, et al., “Accurate, large minibatch SGD: training ImageNet in 1 hour, 2017 [Online]. Available: https://arxiv.org/abs/1706.02677.
[26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, pp. 770-778.
[27] I. Loshchilov and F. Hutter, “SGDR: stochastic gradient descent with warm restarts,” 2016 [Online]. Available: https://arxiv.org/abs/1608.03983.
[28] Z. Wang, Q. She, and A. Smolic, “Action-Net: multipath excitation for action recognition,” 2021 [Online]. Available: https://arxiv.org/abs/2103.07372.
[29] M. Abavisani, H. R. V. Joze, and V. M. Patel, “Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, 2019, pp. 1165-1174.
[30] Q. Ren, “A video expression recognition method based on multi-mode convolution neural network and multiplicative feature fusion,” Journal of Information Processing Systems, vol. 17, no. 3, pp. 556-570, 2021.
[31] G. Zhao, H. Yang, B. Tu, and L. Zhang, “A Survey on Image Emotion Recognition,” Journal of Information Processing Systems, vol. 17, no. 6, pp. 1138–1156, 2021.

Zongjing Cao, Yan Li, and Byeong-Seok Shin*, Multi-Scale Feature-Based Spatiotemporal Pyramid Network for Hand Gesture Recognition, Article number: 12:46 (2022) Cite this article 1 Accesses