ArticlesA Progressive Region-Focused Network for Fine-Grained Human Behavior Recognition
Abstract
Establishing fine-grained human behavior recognition is crucial in human-centric applications. Although modeling human behaviors and activity patterns for recognition or detection has attracted significant research interest recently, implementing fine-grained behavior detection remains challenging. Here, we propose a progressive region-focused network (PRF-Net), a unified framework that interrelates region detection and fine-grained feature learning, implemented by two components working in tandem. To learn a good representation that contains more detailed information for better fine-grained recognition, we propose a progressive attention training strategy that learns discriminative region attention and fosters inherent complementary properties with different lead levels. The second principle is to make the model more specialized to the problem. We introduce a region-focused image generator to locate more discriminative local regions, providing a priori knowledge for the next layer. Experiments show that the developed model achieves state-of-the-art performance on two public action recognition benchmark datasets, with an accuracy of 89.57%–94.50% for PPMI-24 and 60.20%–63.36% for DAR-4. Notably, our approach surpasses some models by using auxiliary information and providing another avenue for practical applications. This remarkable learning ability makes the proposed method readily applicable to fine-grained human behavior recognition, making it valuable for human-centric applications, such as surveillance analysis and human-computer interactions.
Keywords
Human Action Recognition, Fine-Grained Behaviors, Machine Learning, Progressive Training Strategy, Human-Computer Interaction
Introduction
Action recognition in still images [1] detects and classifies a person's behavior based on a single image, driving interest in human-centric applications, such as surveillance analysis and human-computer interactions. Recent advances in deep learning [2] have facilitated human behavior recognition and success in learning difficult recognition tasks. However, action recognition in still images remains challenging due to the highly variable appearance, cluttered background, and lack of spatiotemporal details.
Early approaches [3, 4] commonly use contextual information within a region enclosing the focal person. Thus, pre-annotated bounding boxes are essential because pre-annotated boundaries are crucial to identifying regions of interest. In addition, the recognition accuracy can be further improved by combining region-based approaches and convolutional neural network (CNN). That said, manual annotations are difficult to obtain and often error-prone, leading to performance degradation. Weakly supervised frameworks have consequently captured greater attention, and some excellent solutions have been devised in recent research. The authors of [5–9] used detection models instead of manual annotations as auxiliary information, e.g., detectors for the human body's critical points or body parts. Unfortunately, relying on detectors imposes limitations on the implementation of the model. For example, the model will have a tremendous negative impact when the detector deviates significantly from the expected values, with auxiliary information limitations causing some re-searchers to look elsewhere. Thus, more recent efforts [10–12] use information fusion to compensate for the lack of supporting information.

Fig. 1. Traditional image recognition with distinct differentiation and fine-grained behavior
recognition with subtle differentiation. (a) Generic image recognition consisting distinctive categories. (b) Fine-grained action of “playing” (top) versus “holding” violin by distinct people. (c) Fine-grained action of “drinking water (top) or not” by distinct people. (d) Same fine-grained action “with cello” involving different people. The interactions with musical instruments from PPMI-24
[13] datasets.
The interactions with book and cup from DAR-4 [14] datasets.
Most approaches adopt sample images with distinct image classes (Fig. 1(a)); however, their performance remains poor in distinguishing various fine-grained behaviors (Fig. 1(b)–1(d)), such as playing versus holding the violin [
13,
14]. Typically, fine-grained behaviors are visually similar at a coarse glance. Therefore, subclass categories invariably have nearly the same global appearance, while intraclass image features have considerable differences. A recent study has shown that humans are easily attracted to an image's most prominent local areas, even without additional information. According to this attention representation, using the details of the local area can correctly identify them. Notably, region detection and fine-grained feature learning are interrelated. For example, Fig. 2 shows that the accurate localization of hands can promote the model to learn discriminative hand features, which further facilitates the act of pinpointing whether the person in the picture is playing the violin. Therefore, there is a need to develop methods that generate the fine-grained semantic parts at different levels with a progressive attention strategy and make models correctable for fine-grained human behavior recognition tasks.
In this work, we address these issues by incorporating region-focused detection and fine-grained learning directly into the design of a model (Fig. 2). Specifically, we do not perform the pipeline in an end-to-end manner. Instead, we propose a progressive training strategy that will work on still images with full spatial details. This approach is a progressive region-focused network (PRF-Net), which does not require further auxiliary information to facilitate feature learning. Compared to other studies, the main difference in our work is that PRF-Net performs a multiscale region-focused attention strategy that can reveal fine-scale behavior variation without auxiliary information. With a progressive pipeline, these changes notably influence the performance in fine-grained behavior recognition.

Fig. 2. Overview of our framework consists of a progressive attention training network and several region-focused image generators. It runs in different training steps, and the parameters trained in the current stage are passed to the next training step for the initialization of its parameters.
In practice, we developed a unified framework that inter-relates region detection and fine-grained feature learning. A major focus is a progressive attention training strategy that combines advanced training with self-attention for feature learning. In this strategy, different training steps are run, and the parameters trained in the current stage are passed to the next training step as initialization of its parameters. Extracted features are connected in the last step. However, the naive application of progressive training cannot be efficient. It is also necessary to consider how to transfer the task-relevant knowledge to the latter level. Therefore, another technological focus is a region-focused image generator as the carrier of knowledge that associates certain image regions with the target task. We start with the complete image and iteratively generate areas from coarse to fine. The major contributions of this article are summarized as follows:
We propose a progressive attention training strategy that learns discriminative region attention and fosters inherent complementary properties with different levels of information, capable of efficiently mining the intrinsic structures and capturing human behaviors from a comprehensive view.
We propose a region-focused image generator to locate a more discriminative local region, effectively provide prior knowledge for the next layer, and make the model more reliable.
The proposed PRF-Net has no auxiliary information, and the comprehensive experimental results on the PPMI-24 and DAR-4 datasets show significant improvements over the state-of-the-art (SotA) approaches.
The rest of the paper is organized as follows. In Section 2, previous studies in action recognition are briefly introduced, followed up by our proposed approach explained in Section 3. In Section 4, the implementations and experimental results are provided, and then the conclusions are presented in Section 5.
Related Work
Action Recognition in Still Images
Despite human action recognition having experienced tremendous advances, action recognition in still images is challenging. Due to the lack of time-domain information, current approaches use spatial information for the feature representation of human behavior. For instance, human body poses [15], body parts [6], interactions between people and objects [16, 17], and the whole scene [18] are the most commonly used high-level cues in human action recognition.
Early approaches provide auxiliary information (e.g., bounding boxes, body key point annotations) during training and testing. Therefore, numerous datasets, e.g., Stanford 40 Actions [19] and PASCAL VOC 2012 Action [20], provide detailed manual annotation information. However, annotating bounding boxes is usually labor-intensive and time-consuming. As a result, people gradually turn their attention to weak supervision methods. The authors of [5] used detectors capturing human-object interactions instead of the additional input of manually created human bounding boxes. The authors of [6] proposed a behavioral recognition model incorporating global and local human feature information. They designed two types of human cue information, namely limb angle descriptors and part extractors. The vector of locally aggregated descriptors (VLAD) coding technique combines the depth features of regions in [8], which can identify local and global backgrounds in still images. However, these approaches have been limited by external assistance, such as location information of crucial points on the human body or body part detection models, which needs to be relied upon for implementation. To address this, the authors of [10] have developed a static image-based human activity recognition (HAR) model. They used transfer learning to train some well-known networks and then fine-tuned them accordingly. The authors of [11, 21, 22] used integrated learning techniques to improve the overall accuracy of action classification by combining the predictions of multiple models. Furthermore, some methods have been proposed to combine color information to improve performance. For instance, the authors of [12] combined multiple CNN models and two-color spaces (RGB & oRGB), while Khan et al. [23] incorporated color and shape information in the later stages of model training. The authors of [24] proposed a new spatial-temporal convolutional neural network (STCNN). They introduced a novel image representation domain (Rank_SM-POF) to capture both the actor’s appearance and future movement patterns. Other methods in the past have typically been applied with structures and sensors. The authors of [25] proposed a human parsing network to make the model capture the compositional relationships in the human body structure. In addition, several works [26, 27] have investigated HAR using smartphone sensors, combining ambient intelligence and machine learning to provide a variety of services, such as healthcare and monitoring of daily activities.
In summary, existing human action recognition technologies based on end-to-end trainable frameworks do not have fine-grained recognition capability. They require the behaviors to be precisely differentiable, which is inherently challenging for events in real-world scenarios. Although these approaches have made remarkable progress, fine-grained behavior analyses remain severely constrained with still images. In contrast, we emphasized identifying more fine-grained differentiated regions to improve the performance of static image-based human action recognition.
Progressive Strategy and Attention Mechanism
The progressive strategy and attention mechanism are two important use cases of our approach. A progressive training strategy [28] allows for understanding the picture distribution's large-scale structure before shifting the focus to progressively finer-scale features, rather than learning information directly from all scales. This is important because our approach can exploit a progressive training strategy to gain full spatial details. In the attention mechanism, the goal is to learn relevant features for the specific task.
In a progressive strategy, the authors of [29] proposed the Laplacian pyramid super-resolution network (LapSRN), which is used to progressively reconstruct the sub-band residuals of high-resolution images. The authors of [30] provided a progressive discriminator along with a progressive training strategy in which the network up-samples an image in intermediate steps, while the learning process is organized differentially from easy to hard, as is done in curriculum learning. The authors of [31] used a progressive training strategy with a random puzzle generator to encourage the network to learn complementary features at different granularities. Compared to these efforts, we focus on interrelating region detection and fine-grained feature learning, using the idea of progressive networks. For example, facing the task of distinguishing similar images, humans are still easily attracted to the most salient local regions and correctly classified.
The attention mechanism has demonstrated its superior performance in machine translation, object detection [32], image captioning [33], interaction recognition [34], and other tasks [35]. Recently, visual attention-based deep models have been developed for action recognition from still images. An attention network is presented in [36] with scene-level and region-level contextual cues and a target person's bounding box. The authors of [7] proposed a behavior recognition method based on body parts and attention mechanisms. The authors of [37] were inspired by the self-attentive mechanism model, and proposed a novel human-object interaction module. This module computes the human-object interaction based on spatial as well as appearance features of humans and objects, which significantly enhances the behavioral discrimination of the model. The authors of [38] provided an attention-focused spatial pyramid pooling network (AttSPP-net) free from bounding boxes by jointly integrating the soft attention mechanism and spatial pyramid pooling (SPP) into a CNN.
This work introduces the attention mechanism to make the model learn more relevant features for the specific task. However, it inevitably leads to an increase in the model parametric number. Thus, we design the serialization module, enabling the attention module to capture information at a lower cost.
Methodology
To identify human behaviors in still images, it is desired to accurately (1) localize and recognize human actions and (2) predict the action classes. Central to our approach is a modular design network called the PRF-Net, which can be gradually localized to discriminable part features through progressive learning without introducing auxiliary information.
Here, we considered the following problem (Fig. 3). The training procedure of progressive training, in which each iteration consists of Q(S+1) steps. Specifically, we set S = 3 for the explanation. The training data are enhanced by the region-focused image generator (Rfg) at every iteration and fed into the network. At each step, the output of the appropriate classifier is used for loss calculation and parameter optimization. The serialization module (SL_mod) is designed to reduce the number of parameters and enables the attention module (Att_mod) to capture global information at a lower cost. The circled numbers correspond to the four images as training data at different steps, and the gray rectangle represents frozen stages. Moreover, fc represents the fully connected stage.

Fig. 3. Overview of the proposed architecture: (a) input images, (b) progressive attention training strategy, (c) mixture of attention-weighted, and (d) region-focused image generator (Rfg).
Progressive Attention Training Strategy
We introduce a progressive training strategy where low stages are first trained, and new training stages are added. Moreover, at the end of each training step, the parameters trained in the current step are passed to the next training step to initialize its parameters.
Our network is designed to be generic and could be implemented over any state-of-the-art backbone feature extractor, such as ResNet. By setting the backbone feature extractor to have N stages, and only considering the last S stages, we derive l∈{N-S+1,…,N-1,N}. The output feature map from the l-th step is denoted as

where $C_l$,$H_l$,$W_l$ refers to the feature map's height, width, and channel numbers, respectively. Here, we aim to cultivate the inherent complementary properties between shallow and deep information. Moreover, inspired by the attention model [
35], we apply a vector representation

where

is the serialization attention mechanism. The serialization module $SL_{mod}$ is designed to reduce the number of parameters and enables the attention module (Att_mod) to capture global information at a lower cost. This is followed by an additional classification module

where

represents the fc stage.
For the training of the output features with each step, we adopt cross-entropy $L_{CE}$ between the ground truth label y and the prediction probability distribution to optimize the model. Then, we utilize the cross-entropy loss, which is mathematically shown below:
(1)
where m is the category. Note that at each iteration, we only train one stage's output at each step. All parameters utilized in the current prediction will be optimized, thus enabling each step to work together.
Mixture of Weighted Attention
In general, to improve the performance of fine-grained human behavior recognition, we concatenate the outputs of the last multiple stages, which is mathematically denoted as:
(2)
However, we believe that simple concatenation cannot maximize the benefits. Thus, at the last step of each iteration, our model calculates bias values wit[i] based on the ability of each layer in the attention network to capture critical features, and we weight the information at the end for fusion, which is mathematically shown below:
(3)
(4)
where i∈{0,1,…,S-1}, $L_i$ denotes the loss value after the i-th step of progressive network training, L=[$L_1$,...,$L_2$,$L_S$]. This is followed by a classification model
$y_{concat}$ =$F_{class}$ ($X_{concat}$ ). Then, we use the cross-entropy loss to optimize the model. The mathematical formula is denoted as:
(5)
Region-Focused Image Generator
Our pipeline provides a critical generator called a region-focused image generator, which extracts local regions and provides new training data for the progressive network. At every iteration, the training data are enhanced by the image generator of the focused region, and fed into the network in turn (Fig. 4). In this way, the process of generating operates in two steps of attention to localization and region cropping and enlargement.
Fig. 4. Example of the region-focused image generator. After gradually zooming in on the attended regions, we can observe clear and significant visual cues for classification.
Attention to localization (ATL): Given the input images $X^α$∈R^(C',H',W'), we first extract features by feeding the images into pre-trained convolutional layers to obtain feature maps X∈$R^{(C,H,W)}%. To squeeze the global spatial information into a channel descriptor, we use global average pooling to generate channel-wise statistics. Formally, a statistic s∈RC is produced by shrinking X=[$x_1$,$x_2$,…,$x_C$] based on its spatial dimensions H×W, where the c-th element of s is derived, as mathematically shown below:
(6)
We performed a second operation using the information aggregated in the squeeze operation. To limit model complications and aid generalization, we form a dimensionality-increasing layer that returns to the channel dimension of transform output X followed by a ReLU [
39]. The final output of the block X^fin is derived, as mathematically shown below:
(7)
(8)
where

refer to channel-wise multiplication between the scalar $s_c$ and the feature map $u_c$∈$R^{(H×W)}$ and δ refers to the ReLU function.
Then, we can clearly locate index values 〖id〗_max=argMax($X^{fin}$) of the most significant points [$x_{max}$,$y_{max}$] in the feature map, which is mathematically denoted as:
(9)
where w,h are the width and height of the feature map, respectively.
Region cropping and enlargement (RCE): After that, we remap the coordinates [$x_{max}$,$y_{max}$] of the most discriminative point to the original images [x,y], and then the points [x,y] are used as the center point of the new image, which is mathematically denoted as:
(10)
where w',h' are the width and height of the original picture, respectively.
Then, we can utilize a threshold κ to calculate the bounding box for locating the significant region in the images, as mathematically shown below:
(11)
(12)
After the attended regions have been localized, we need to further enlarge the region X' to ensure that each input stays the same size. Specifically, we use bilinear interpolation to recover the initial size of the image, which is mathematically denoted as:
(13)
All batches processed by the region-focused image generator share the same label Y. Finally, the obtained new images (X'',Y) are fed to the next step of the progressive network.
Inference
The inference is essential to model the serialization and attention modules. To reduce computational costs, we feed only the original images into the trained model and leave out the region-focused image generator (Fig. 5).
Fig. 5. Illustration of our inference modeling. In this stage, we only feed the picture to the last
stage in the network. Since there is no weight value at this point, we directly fuse the feature
maps after Att_mod.
In addition, the first three steps can be removed, thus reducing the computational budget, which in turn has a negligible impact on the accuracy of our model:
(14)
This is followed by an additional classification module

where F_class^"concat" represents the fully connected stage. We obtain the prediction probability distributions $y_{concat}$ . In this case, the final result $C_{infer}$ is shown mathematically as:
(15)
Experiment and Results
Datasets
Multiple datasets of human action recognition have been proposed. However, most are too coarse to perform our task. For instance, playing and holding a violin in the PPMI dataset, the background and the object can be used as a basis for discrimination, and the differences between the actions are very small. In contrast, when kicking a soccer ball and playing basketball in common datasets, the differences between the actions are large, and other bases exist. Therefore, we have performed experiments on the People Playing Musical Instruments (PPMI-24) [13] and Drinking and Reading (DAR-4) datasets [14] and presented a detailed overview (Table 1). The two datasets are both highly balanced, and the number of examples in each class is almost identical. In such a case, the weights are similar enough that they do not bias the model, avoiding misclassifications due to imbalances in the data.
Table 1. Experimental datasets
Dataset |
Classes number |
Training |
Val |
Test |
DAR-4 |
4 |
3728 |
480 |
480 |
PPMI-24 |
24 |
2400 |
- |
2400 |
There is no validation set in the PPMI-24 dataset.
PPMI-24 is one of the benchmark datasets used for action recognition. There are two categories of activities in the dataset according to whether a person is playing (PPMI+) or holding an instrument (PPMI–). There are 24 interactive actions, including 4,800 photos, where 100 training and 100 test examples are supplied for each action (Fig. 6).
Fig.6. Some sample images from the PPMI-24 dataset.
The DAR-4 dataset focuses on binary tasks of the form, given as drinking water (liquid in mouth) or not and reading (gaze toward text) or not. There are 2,164 images for drinking and 2,524 for reading with 50% “yes” labels. The person who appears in each set of photos is unique in that set. For example, if a person is in the training set, they are in neither the validation nor the test set (Fig. 7).
Fig. 7. Some sample images from the DAR-4 dataset.
Experimental Settings
The effectiveness of the proposed approach is evaluated on various comparisons.
Evaluation metrics: Following a standard protocol [
13,
14] for the two datasets, we report the accuracy (Acc) and mean average precision (mAP) to evaluate the quantitative results. In addition, the precision-recall (P-R) curve and receiver operating characteristic (ROC) curve are used because they are suitable to evaluate the performance of fine-grained object detectors as the confidence is changed by plotting a curve for each object class. Furthermore, we propose some qualitative comparison results with visual comparison, and compare the parameter quantities and average inference speed of the models.
Implementation details: All the experiments are implemented employing PyTorch 1.7 over NVIDIA GeForce RTX 3080 GPUs. We adopted ResNet-50 [
40] pre-trained on ImageNet [
41] as the backbone to construct our model during training, which indicates the total number of stages n = 5. We do not use auxiliary information in our experiments, and the category labels of the images are the only annotations used for training. In the training phase, we resize images to 260x260 and then randomly crop them to 256x256 with random horizontal flipping. During testing, the input images are resized to 260x260 and randomly clipped to 256x256.
With the stochastic gradient descent (SGD) optimizer, the learning rates are initialized as 0.002 and reduced by following the cosine annealing schedule during training. In addition, we train all the datasets for up to 100 epochs with a batch size of 32 and use a weight decay of 0.0005 and momentum of 0.9.
Quantitative Results
To evaluate the fine-grained recognition ability of our approach, we compare PRF-Net with state-of-the-art baselines on two fine-grained human behavior recognition datasets (PPMI-24 and DAR-4). We used the standard metrics of accuracy and the mAP to evaluate the quantitative results and are presented as percentages (%). Since we are not using additional annotations, we mostly compare our results to the method without any auxiliary information.
Comparisons with state-of-the-art on the PPMI-24 dataset
Table 2 compares our approach with standard methods and uses auxiliary information (VLAD, R-FCN), and the confusion matrix for the best ensemble is presented in Fig. 8. These results demonstrate that our method currently achieves the best accuracy on PPMI-24 datasets and remains on PPMI+ (PPMI–). Moreover, we evaluated statistical performance with the kappa score and p-value (accuracy > no information rate [NIR]): kappa = 0.89, and p < 0.01, further showing that a higher accuracy performance has no statistically significant association with categorical distribution on the PPMI-24.
Standard methods: Our method further outperforms the best SotA [
10] by 15.57%, achieving the best overall performance compared to other methods. Although Zhao et al. [
42] achieved 92.90% for the 12-class binary classification (1.6% lower than our results in the play instrument) using the generalized symmetric pair model (GSPM), the 24-class classification achieved only 51.70%. Additionally, our method significantly outperforms the color fusion model and deep ensemble learning based on the voting strategy (DELVS) model [
11,
12], while the mAP value of our model also performs at least 3.58% better than the other models.
Using auxiliary information: The part detector-based VLAD [
8] method attained 81.30% on PPMI-24, while the R-FCN [
5] achieved 83.40% based on action detection. However, our method did even better at 8.27% and 6.17% higher than VLAD and R-FCN, respectively, demonstrating our method's continuing outperformance over those methods with auxiliary information.
Table 2. Comparison of different approaches on the PPMI-24 dataset (unit: %)
Method |
Year |
Play instrument |
With instrument |
PPMI-24 |
Accuracy |
mAP |
GSPM [42] |
2017 |
92.9 |
89.3 |
51.7 |
- |
Color fusion [12] |
2020 |
74.31 |
65.39 |
65.85 |
- |
DELVS [11] |
2020 |
77.25 |
70.1 |
69.54 |
74.71 |
TL HAR [10] |
2021 |
85.03 |
74.28 |
74 |
89.64 |
VLAD [8] |
2017 |
- |
- |
81.3 |
- |
R-FCN [5] |
2019 |
- |
- |
83.4 |
- |
Proposed |
2022 |
94.5 |
92.03 |
89.57 |
93.22 |
Fig. 8. Confusion matrix for the PRF-Net on the PPMI-24 dataset. Each row represents the ground
truth label, while the column shows the label obtained by PRF-Net.
Comparisons with state-of-the-art on the DAR-4 dataset
Table 3 summarizes the performance of PRF-Net on the DAR-4 dataset, and the confusion matrix is shown in Fig. 9. In addition, we evaluated statistical performance with the kappa score and p-value (accuracy > NIR): kappa= 0.50 and p < 0.01, showing that the accuracy performance has no statistically significant association with categorical distribution on the DAR-4 dataset.
Our PRF-Net achieved 60.20% on the DAR-4 dataset. It is significantly higher than all the SotA approaches shown in Table 3. The accuracy of our PRF-Net outperforms the best SotA [
10] by 1.33% on the DAR-4 dataset. Detectron V1 and Detectron V2 achieved 56.29% and 53.08% accuracies, respectively, or at least 3.91% lower than ours [
14]. They applied algorithms to impose the definition of each action to detect the corresponding elements and their interactions. Our method outperforms DELVS [
11] by a significant margin. These results also demonstrate that our model performs well on binary classification tasks (drinking and reading). In addition, the mAP value of our model also performs at least 1% better than the other models.
Table 3. Comparison of different approaches on the DAR-4 dataset (unit: %)
Method |
Year |
Drinking |
Reading |
DAR-4 |
Accuracy |
mAP |
Detectron V1 [14] |
2020 |
52.9 |
62.8 |
56.29 |
- |
Detectron V2 [14] |
2020 |
57.3 |
56.1 |
53.08 |
- |
DELVS [11] |
2020 |
57.44 |
58.63 |
54.01 |
54.65 |
TL HAR [10] |
2021 |
59.77 |
60.95 |
58.87 |
61.25 |
Proposed |
2022 |
63.36 |
64.08 |
60.2 |
62.46 |
Fig. 9. Confusion matrix for PRF-Net on the DAR-4 dataset. Each row represents the ground truth label, while the column shows the label obtained by PRF-Net.
Quantitative Results
In addition to the traditional quantitative evaluation, we propose a visual comparison to demonstrate the qualitative results to explore whether our approach is more explicit about objectives. Fig. 10 depicts some qualitative comparison results on the PPMI dataset. We observe that our approach outputs more accurate parsing results than other competitors [
10,
11]. Compared to the fine-grained level, our model performs better for some subtleties, such as the hand of the character holding the violin in Fig. 10(b), which our model can capture fully. Moreover, our model eliminates the interference from the background.
Fig. 10. Visual comparison of the PPMI dataset: (a) play recorder and (b) play violin. Our model produces more accurate predictions with other methods [10, 11].
Performance Evaluation
To address the limits in accuracy metrics, we further evaluated our model based on the P-R curve and ROC curve, as shown in Sections 4.5.1 and 4.5.2, with both curves outperforming the other models.
4.5.1 P-R curves
The proposed method's P-R curve is shown in Fig. 11. In Fig. 11(a), our method's average precision (AP) values outperform other methods in all categories. In Fig. 11(b), the P-R curves of our model and the TL HAR method intersect, and we further compute the F1-score to demonstrate how our model outperforms, based on the F1-score of our model (0.60) and F1-score of the TL HAR model (0.58). This shows an improvement in the DAR-4 dataset compared to other methods.
Fig. 11. Precision-recall curves analysis: (a) PPMI dataset and (b) DAR-4 dataset. The green curve indicates the result based on the TL HAR method [10], the orange curve indicates the result based on the DELVS method [11], and the blue indicates the result of our method.
Fig. 12. Receiver operating characteristic curve analysis: (a) PPMI dataset and (b) DAR-4 dataset.
The orange curve indicates the result based on the TL HAR method [10], the green curve indicates
the result based on the DELVS method [11], and the blue indicates the result of our method.
Runtime Comparison
In real-world applications, we should minimize the model parameters and running time. Table 4 depicts the model parameters and average inference speed of various methods averaged over two datasets. In this case, three state-of-the-art methods, DELVS [
11], Detectron V1 [
14], and TL HAR [
10] were selected for the purpose of comparison and keeping all hardware environments consistent. The experimental results show that the average inference speed of our PRF-Net is only 5.6 ms, which is an approximate 8x speedup compared to the method of [
11,
14] and nearly three times less than that of [
10].
Table 4. Comparison of different approaches on model parameters and average inference speed
Method |
Year |
Parameter (Mb) |
Runtime (ms) |
DELVS [11] |
2020 |
318 |
42.2 |
Detectron V1 [14] |
2020 |
282 |
40.5 |
TL HAR [10] |
2021 |
233.6 |
16.7 |
Proposed |
2022 |
198.4 |
5.6 |
Ablation Study
To demonstrate how each component in our parsers contributes to the performance, we also conducted a comprehensive ablation study on the PPMI-24 and DAR-4 datasets to analyze the contributions of different components in the proposed framework PRF-Net. The training parameters and evaluation procedures are the same as those described in Section 4.2. Table 5 shows that our components are critical in improving performance, specifically self-attention and progressive training. For example, the ResNet-50 baseline provided an accuracy of 75.33% on the PPMI-24 dataset, which subsequently improved to 86.21%. Next, adding the region-focused image generator improves the accuracy to as much as 88.03%. Finally, the mixture of attention weights in the last step improves the accuracy to 89.57%.
Table 5. Accuracy (%) of PRF-Net with the addition of main components
Components |
PPMI-24 classes |
DAR-4 classes |
Base ResNet-50 |
75.33 |
53.95 |
+Progressive attention training |
86.21 |
57.43 |
+Region-focused image generator |
88.03 |
59.54 |
+Mixture of attention-weighted |
89.57 |
60.2 |
Failure Analysis
To provide deeper insight into our methods, we present two representative failure cases. As shown in Fig. 13, all the models face difficulties in scenarios where the action or action-related objects are obscured.
These results indicate the limitation of focusing on segmenting the fine-grained semantic parts of the human body. Our method assumes that the goal is reachable where the human body produces the action, it still requires a deeper semantic understanding of each entity instance.
Discussion
Our research has studied progressive learning and multiscale attention capabilities, drawing inspiration from human cognitive mechanisms when designing the learning and recognition architecture. Experiments suggest that our approach promotes fine-grained feature learning without any auxiliary information. In addition, task-oriented optimization also boosts performance and exhibits great potential to train fine-grained classifiers in deep neural networks. Collectively, we found that the performance of a region-focused image generator with a progressive attention strategy is enhanced during tasks compared to passive experience, and that fine-grained semantic parts are an avenue to explore. Alternatively, our approach may produce inferior results for fine-grained human behavior recognition at microscopic scales. In addition, the failure analysis highlights some interesting directions for future efforts.

Fig. 13. Visualizations of typical failure cases on the PPMI test set: (a) play cello and (b) play erhu.
Conclusion
In this work, we propose a PRF-Net strategy for fine-grained human behavior recognition. This mechanism is a unified framework that interrelates region detection and fine-grained feature learning, implemented by two components working in tandem. (1) A progressive attention training strategy learns discriminative region attention and fosters inherent complementary properties between different information. (2) A region-focused image generator to locate a more discriminative local region. The proposed framework is trained without using auxiliary information other than category labels. We obtained outstanding accuracy that outperforms the state-of-the-art in all accuracy measures. It is worth mentioning that we even surpassed some models that use auxiliary information, providing another avenue for practical applications. As a learning-based approach, PRF-Net leverages computational complexity and shows the advantage of generating region-focused behaviors compared to traditional approaches. On the other hand, our approach provides the added benefit of allowing localization at an individual level, thus predicting which individuals perform particular actions. This is an advantage for future applications such as monitoring wild animals.
With that said, our approach is limited by the size of available fine-grained behavior datasets, so future studies should demonstrate a new large-scale dataset for fine-grained human behavior recognition. Moreover, there are many avenues in which our proposed methods might be improved or extended. For instance, there is often a failure case when human actions occur at very small scales and in complex scenarios (refer to Section 4.8). In future work, it would be most interesting to fully use human-object interactions at the scene level, enabling more comprehensive action recognition.
Author’s Contributions
Writing of the original draft, MG. Investigation and methodology, MG, FJ. Software, MG. Data curation, JF. Writing of the review and editing, FJ.
Funding
This work is supported by the National Natural Science Foundation of China (No. 41971365).
Competing Interests
The authors declare that they have no competing interests.
Author Biography

Name: Jiangfan Feng
Affiliation: College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, China
Biography: Jiangfan Feng received his Ph.D. in Cartography and Geographic Information Systems at Nanjing Normal University, Nanjing China, in 2007. He is currently aprofessorat Chongqing University of Posts and Telecommunications. His research interests include video GIS, and computer vision

Name: Mengjie Gou
Affiliation: College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, China
Biography: Mengjie Gou is currently pursuing the M.S. degree from the Department of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing China. Her research interests include intelligent intelligent surveillance system and computer vision
References
[1]
D. Girish, V. Singh, and A. Ralescu, “Understanding action recognition in still images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, 2020, pp. 1523-1529.
[2]
M. Pak and S. Kim, “A review of deep learning in image recognition,” in Proceedings of 2017 4th International Conference on Computer Applications and Information Processing Technology (CAIPT), Kuta Bali, Indonesia, 2017, pp. 1-3.
[3]
M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring mid-level image representations using convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, 2014, pp. 1717-1724.
[4]
G. Gkioxari, R. Girshick, and J. Malik, “Contextual action recognition with R*CNN,” in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 2015, pp. 1080-1088.
[5]
F. G. O. Barbosa and M. R. Stemmer, “Action recognition in still images based on R-FCN detector,” 2019 [Online]. Available: https://doi.org/10.17648/sbai-2019-111140.
[7] B. Bhandari, G. Lee, and J. Cho, “Body-part-aware and multitask-aware single-image-based action recognition,” Applied Sciences, vol. 10, no. 4, article no. 1531, 2020.
https://doi.org/10.3390/app10041531
[8] S. Yan, J. S. Smith, and B. Zhang, “Action recognition from still images based on deep VLAD spatial pyramids,” Signal Processing: Image Communication, vol. 54, pp. 118-129, 2017.
[9] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, 2019, pp. 5693-5703.
[10] S. Chakraborty, R. Mondal, P. K. Singh, R. Sarkar, and D. Bhattacharjee, “Transfer learning with fine tuning for human action recognition from still images,” Multimedia Tools and Applications, vol. 80, pp. 20547-20578, 2021.
[11] X. Yu, Z. Zhang, L. Wu, W. Pang, H. Chen, Z. Yu, and B. Li, “Deep ensemble learning for human action recognition in still images,” Complexity, vol. 2020, article no. 9428612, 2020.
https://doi.org/10.1155/2020/9428612
[12] Y. Lavinia, H. Vo, and A. Verma, “New colour fusion deep learning model for large-scale action recognition,” International Journal of Computational Vision and Robotics, vol. 10, no. 1, pp. 41-60, 2020.
[13] B. Yao and L. Fei-Fei, “Grouplet: a structured image representation for recognizing human and object interactions,” in Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, 2010, pp. 9-16.
[14]
V. Jacquot, Z. Ying, and G. Kreiman, “Can deep learning recognize subtle human activities?,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, 2020, pp. 14232-14241.
[15] S. Pratt, M. Yatskar, L. Weihs, A. Farhadi, and A. Kembhavi, “Grounded situation recognition,” in Computer Vision–ECCV 2020. Cham, Switzerland: Springer, 2020, pp. 314-332.
[16] Y. Zheng, X. Zheng, X. Lu, and S. Wu, “Spatial attention based visual semantic learning for action recognition in still images,” Neurocomputing, vol. 413, pp. 383-396, 2020.
[17]
T. Wang, T. Yang, M. Danelljan, F. S. Khan, X. Zhang, and J. Sun, “Learning human-object interaction detection using interaction points,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, 2020, pp. 4115-4124.
[18] M. Xin, S. Wang, and J. Cheng, “Entanglement loss for context-based still image action recognition,” in Proceedings of 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 2019, pp. 1042-1047.
[19] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei, “Human action recognition by learning bases of action attributes and parts,” in Proceedings of 2011 International Conference on Computer Vision, Barcelona, Spain, 2011, pp. 1331-1338.
[20] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (VOC) challenge,” International Journal of Computer Vision, vol. 88, pp. 303-308, 2010.
[21] S. Mohammadi, S. G. Majelan, and S. B. Shokouhi, “Ensembles of deep neural networks for action recognition in still images,” in Proceedings of 2019 9th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 2019, pp. 315-318.
[22]
H. A. Dehkordi, A. S. Nezhad, S. S. Ashrafi, and S. B. Shokouhi, “Still image action recognition using ensemble learning,” in Proceedings of 2021 7th International Conference on Web Research (ICWR), Tehran, Iran, 2021, pp. 125-129.
[23] F. S. Khan, R. Muhammad Anwer, J. Van De Weijer, A. D. Bagdanov, A. M. Lopez, and M. Felsberg” Coloring action recognition in still images,” International Journal of Computer Vision, vol. 105, pp. 205-221, 2013.
[24] M. Safaei and H. Foroosh, “Still image action recognition by predicting spatial-temporal pixel evolution,” in Proceedings of 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, 2019, pp. 111-120.
[25]
W. Wang, T. Zhou, S. Qi, J. Shen, and S. C. Zhu, “Hierarchical human semantic parsing with comprehensive part-relation modeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 7, pp. 3508-3522, 2021.
[26]
A. R. Javed, R. Faheem, M. Asim, T. Baker, and M. O. Beg, “A smartphone sensors-based personalized human activity recognition system for sustainable smart cities,” Sustainable Cities and Society, vol. 71, article no. 102970, 2021.
https://doi.org/10.1016/j.scs.2021.102970
[27]
A. R. Javed, M. U. Sarwar, M. O. Beg, M. Asim, T. Baker, and H. Tawfik, “A collaborative healthcare framework for shared healthcare plan with ambient intelligence,” Human-centric Computing and Information Sciences, vol. 10, article no. 40, 2020.
https://doi.org/10.1186/s13673-020-00245-7
[28]
T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of GANs for improved quality, stability, and variation,” in Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018.
[29]
W. S. Lai, J. B. Huang, N. Ahuja, and M. H. Yang, “Deep Laplacian pyramid networks for fast and accurate super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 2017, pp. 5835-5843.
[30]
Y. Wang, F. Perazzi, B. McWilliams, A. Sorkine-Hornung, O. Sorkine-Hornung, and C. Schroers, “A fully progressive approach to single-image super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, 2018, pp. 864-873.
[31]
R. Du, D. Chang, A. K. Bhunia, J. Xie, Z. Ma, Y. Z. Song, and J. Guo, “Fine-grained visual classification via progressive multi-granularity training of jigsaw patches,” in Computer Vision–ECCV 2020. Cham, Switzerland: Springer, 2020, pp. 153-168.
[32]
Z. Zhang, Z. Lin, J. Xu, W. D. Jin, S. P. Lu, and D. P. Fan, “Bilateral attention network for RGB-D salient object detection,” IEEE Transactions on Image Processing, vol. 30, pp. 1949-1961, 2021.
[33]
C. Yan, Y. Hao, L. Li, J. Yin, A. Liu, Z. Mao, Z. Chen, and X. Gao, “Task-adaptive attention for image captioning,” IEEE Transactions on Circuits and Systems for Video technology, vol. 32, no. 1, pp. 43-51, 2021.
[34]
T. Wang, R. M. Anwer, M. H. Khan, F. S. Khan, Y. Pang, L. Shao, and J. Laaksonen, “Deep contextual attention for human-object interaction detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 2019, pp. 5693-5701.
[35]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, pp. 5998-6008, 2017.
[36]
S. Yan, J. S. Smith, W. Lu, and B. Zhang, “Multibranch attention networks for action recognition in still images,” IEEE Transactions on Cognitive and Developmental Systems, vol. 10, no. 4, pp. 1116-1125, 2018.
[37]
W. Ma and S. Liang, “Human-object relation network for action recognition in still images,” in Proceedings of 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 2020, pp. 1-6.
[38] W. Feng, X. Zhang, X. Huang, and Z. Luo, “Attention focused spatial pyramid pooling for boxless action recognition in still images,” in Artificial Neural Networks and Machine Learning–ICANN 2017. Cham, Switzerland: Springer, 2017, pp. 574-581.
[39]
V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, 2010, pp. 807-814.
[40]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, pp. 770-778.
[41]
J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “ImageNet: a large-scale hierarchical image database,” in Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, 2009, pp. 248-255.
[42]
Z. Zhao, H. Ma, and X. Chen, “Generalized symmetric pair model for action classification in still images,” Pattern Recognition, vol. 64, pp. 347-360, 2017.
About this article
Cite this article
Jiangfan Feng* and Mengjie Gou, A Progressive Region-Focused Network for Fine-Grained Human Behavior Recognition, Article number: 13:10 (2023) Cite this article 2 Accesses
Download citation
- Received20 April 2022
- Accepted17 August 2022
- Published15 March 2023
Share this article
Anyone you share the following link with will be able to read this content:
Provided by the Springer Nature SharedIt content-sharing initiative
Keywords