ArticlesAll Issue
ArticlesVision-based Skeleton Motion Phase to Evaluate Working Behavior: Case Study of Ladder Climbing Safety
• Zaili Chen1,2, Li Wu1,*, Huagang He1, Zeyu Jiao2, and Liangsheng Wu2

Human-centric Computing and Information Sciences volume 12, Article number: 01 (2022)
https://doi.org/10.22967/HCIS.2022.12.001

Abstract

Since workers’ unsafe behavior is one of the major risks related to construction accidents and injuries, behavior management plays an important role in enhancing construction safety. The rapidly developed computer vision approaches have been utilized to determine unsafe behaviors in construction. Nevertheless, the evaluation of actions from a perspective of construction regulations poses a significant research challenge due to the complexity of spatio-temporal features of movement. In an effort to provide an automated and robust methodology to analyze the working behavior, this paper proposes a framework of vision-based skeleton motion phase feature for evaluating the normalization of behavior. The framework developed is used to: efficiently obtain the human skeletons from imagery data based on convolutional neural network (CNN) models; automatically extract the motion phase feature of each 2D skeleton data; and evaluate the behavior by the sequence characteristics of a synthesis of bone movement (e.g., limbs movement). To validate our approach, a case study of ladder climbing is undertaken to distinguish between three typical climbing postures based on collected climbing video data in the laboratory. The results reveal that the proposed framework can potentially achieve promising performance at detecting safe/unsafe actions by evaluating regular/irregular movements of workers.

Keywords

Safety Management, Video Surveillance, Behavioral Analysis, Motion Phase, Deep Learning, Convolution Neural Network

Introduction

Workers’ unsafe behavior in construction sites is one of the major risks related to construction accidents and injuries, according to Heinrich[1], it has been observed that approximately 88% of all accidents that occur during construction materialize as a consequence of unsafe behavior. Preventing fall-from-height (FFH) injuries is a major challenge in construction. Among them, falling from the ladder frequently occurs. Following the work specifications of climbing ladders can effectively reduce the risk of FFH[2,3].In order to ensure the success of a construction project from a safety perspective, it is critical to managing the employees to perform their work in a safe manner. Hence, worker behavior management plays an important role in enhancing construction safety.
A number of methods[4] have been developed to automatically recognize the actions of individual people from images or videos. A detected action is assessed according to whether it belongs to the unsafe actions category. However, the complicated and dynamic nature of worker actions in construction makes the definition and evaluation of unsafe behaviors challenging. In reality, the same action that interacts with different objects in different scenes leads to various safety assessment results. Furthermore, the risk and safety of a worker’s behavior is not only related to its work object but also affected by its normative. So far, there are few studies are focusing on the normative evaluation of working behavior based on computer vision and deep learning techniques.
The scope of this paper focuses on automated motion analysis to evaluate the unsafe climbing behaviors of construction workers. Against such contextual backdrop, we develop a computer vision-based approach that can be deployed to assess and evaluate worker’s behavior from some kind of normative perspective. Video cameras widely adopted for surveillance in construction are used to collect data. Then, a deep learning method combined pose estimation module with object detection[5] model is applied to obtain the motion data of human skeleton. By introducing the joint locations information, a vision-based skeleton motion phase feature can be extracted robustly to align the motion sequences in our work. Motion phase[6] represents the time-series progression of bone movement that has always been applied in computer graphics and animation. Based on this enriching representation of actions, a human skeleton-based behavior assessment has been achieved to deals with actions as a series of asynchronous movements of the human body where different body parts move at different and consistently changing frequencies/phase shifts, allowing more complicated behavior analysis.
The rest of this paper is organized as follows. Related works are reviewed in Section 2 and the proposed approach is described in Section 3 in detail. Section 4 presents the experiments conducted as a case study of ladder climbing and the performance evaluation is illustrated. Section 5 discusses the contributions and limitations of our work. Finally, Section 6 concludes by summarizing the paper’s content and brings the future work. To the end of this section, the contributions of this paper can be summarized as follows:

A novel framework that can assess and evaluate the complex working behaviors in construction efficiently and effectively, such as ladder climbing.

Aconvolutional neural network (CNN) based model to estimate the skeletal structured data where a human-object-overlapping module is adopted to locate the region-of-interest (ROI) target for construction safety management(see Section 3.1).

A scheme to automatically extract motion phase variables robustly that functionally describes the dynamical asynchronous movements of each bone(see Section 3.2).

An experimental and safety training platform of this framework is built and the evaluation is taken by a case study of ladder climbing.

Related Work

Human Behavior Management and Observation
Unsafe acts of people and unsafe mechanical or physical conditions are the identified causal factors according to Heinrich’s domino model of accident causation. Research shows that behavioral safety analysis has a significant effect on risk management. Traditional behavior measurement and evaluation methods have been predominately based on observational methods which can provide intuitive information for feedback and correction[7]. With a prefabricated safety-related behaviors checklist, the workers learn about the unsafe actions that result in accidents, and by watching and discussing about their own and colleagues’ behavior can help them make a significant improvement to their working behavior[8]. However, these methods are subjective in nature which rely on the observer’s individual capabilities(e.g. a safety manager’s expertise or experience). Even though, the safety managers are challenged with continuously observing and identifying the worker’s behavior may lead to accidents under the dynamic and complex work conditions on construction sites[9]. Moreover, the time-consuming and labor-intensive limitations of such methods have led to an urgent need for cost-effective and reliable methods to measure workers’ behavior on construction sites. To cope with these issues, over the past decades, researchers have explored various methodologies to improve construction safety action management, especially the development of sensor-based technologies that make it possible to explore a method of quantitatively assessing individual workers’ safety performance[10]. Some studies follow the idea to establish the zone-based safety risk model to understand the behavior of workers based on location information while the tracking sensor systems typically use the Bluetooth low energy (BLE), radio frequency identification (RFID), global positioning system (GPS), ultra-wideband (UWB) or a combination of them[1113]. On the other side, the rapid development of the motion capture systems encourages researchers turn to human behavior recognition directly[14]. A marker-based motion capture system like VICON-460 can provide high accuracy and frequency sensing data to complete several tasks like workers’ postural stability assessment and gait analysis. Non-contact-sensor-based methods could collect data in a non-invasive way with a relatively low accuracy but sufficient to recognize human actions[15]. For example, Han and Lee [16] collected motion data by stereo cameras to reconstruct a 3D skeleton model and used pattern recognition to identify common unsafe actions.
Though a marker-less vision-based motion capture system provides less accurate results compared to the marker-based system like VICON or OptiTrack, it has no requirement for sensors or markers to be attached to a human body for motion tracking which is difficult to apply in practice at the construction sites due to the attached devices affect the movement of workers. To this end, the computer vision-based approach which can non-invasive monitor automatically and continuously at construction sites has emerged as a trend in construction safety research [17, 18].

Vision-based Human Action Recognition
Human action recognition (HAR) has been studied for the last few decades in both computer vision and construction domain[19]. According to whether the detection data is based on the joint points of human bones, it can be divided into two types: “video-based” [20] and “skeleton-based” [21].

Video-based HAR
The conventional computer vision methods for human action recognition rely on extracting handcrafted features which require two steps of image representation and action classification. As the main purpose of human action recognition in the construction field is to conduct safety assessments and risk prevention. Research efforts have adopted and further developed diverse approaches to analyze the feature extracted from image to recognize the human action then safety assessment, such as: (1)direct classification (e.g., support vector machine classifiers), (2)temporal state-space models (e.g., Hidden Markov models), (3) conditional random fields, and (4)detection-based methods (e.g., bag-of-feature-words). Such methods tightly depend on the prior information of the construction site and the handcrafted feature extraction. Recently the rapid development of deep learning algorithms has been attracting attention for various tasks. Deep learning methods incorporating CNN have been demonstrated the potential to be an effective method for computer vision and pattern recognition which is suitable for the action recognition-based safety management task in construction. For example, Ding et al.[22] proposed an unsafe action recognizing model that integrates a CNN and long short-term memory (LSTM). Then, using a density-based spatial clustering algorithm to determine the dynamic workspaces based on the human action data. Fang et al.[23] presented a faster-R-CNN model to detect worker’s unsafe behavior whether wearing their harness or not. Then developing a Mask R-CNN model can accurately detect workers’ behavior in a specific scene where traverses concrete/steel supports during construction.

Skeleton-based HAR
Unlike video-based action recognition methods that rely on image features that may produces unreliable results sensitive to the viewpoint, noise, and occlusions, the joint locations and joint angles can enrich the representation of actions through producing conciseness, robustness and view-independent feature information[24]. Many methods based on handcrafted skeleton features have been introduced for human action recognition. For example, an easy to implement descriptor based on the similarity of joint angles was proposed[25], which produced a relatively small feature set and was suitable for real-time applications. Recently, several action recognition methods based on deep learning have been proposed. While CNNs or recurrent neural networks (RNNs) are usually used to capture spatio-temporal features in 3D skeleton sequences. For instance, Wang et al.[26] proposed a joint trajectory maps (JTM) based CNNs to encode spatio-temporal information carried in 3D skeleton sequences into multiple 2D images. But this method cannot distinguish some actions with similar motion trails due to trajectory overlapping and losing past temporal information. Zhang et al.[27] presented a RNNs model with LSTM architecture for end-to-end view adaptive human action recognition from skeleton data. However, the ability of RNNs to recognize useful information from multiple features is weak, which limits its performance. In order to capture as much spatial-temporal information as possible, Li et al.[28] combined the LSTM and CNN to conduct effective recognition with multi-channel feature score fusion, which needs higher requirements for computing performance.
Though either conventional methods or deep learning approaches for human action recognition have shown promising results and provided valuable insight for construction. The high computational loss will lead to poor performance in practical applications. In addition, previous work will encounter difficulties when faced with some more challenging situations, such as more complex actions or assessing the standardization of actions. Abnormal behavior may lead to construction accidents and injuries under certain circumstances, and the intra-class compactness of the working behaviors with similar movement makes it difficult to distinguish normal behavior from abnormal behavior. This has inspired the idea of evaluating working behavior through robust motion characteristics of 2D human skeletons with less computation cost. However, this brings the new problem for the data collection of skeleton information from image data, and then how to deal with the dynamical asynchronous movements is an ongoing challenge.

Of all construction accidents, falls from height are the leading cause of serious injuries (48%) and fatalities (30%). Notably, the occurrence of fall from ladder account for 17%[29]. A regulatory training standard for construction workers can effective in preventing the incidence of FFH injuries[30]. Since ladders are an important tool for working at heights, a number of methods for determining unsafe behaviors have been already explored. Liu et al.[31] investigated a tracking approach to 3D human skeleton extraction from stereo video streams for construction workers’ safety and ergonomics. A case study of ladder climbing[32] was taken by using the Kinect to detect the specific unsafe action based on the motion classification. Normal climbing, with an object, backward-facing, reaching far have been identified as four climbing actions for the ladder in the extant literature[33], and the detection results achieved good performance.
As claimed by the aforementioned methods, either depth sensors or stereo cameras are required to obtain 3D pose information of human which can hardly implement at construction sites due to the limitation of time-consuming pre-install and high computational consumption. With a large number of RGB cameras installed on construction sites for surveillance purposes, monocular-based motion analysis methodologies have increased the potential to observe and evaluate worker behaviors for the prevention of accidents and injuries in construction.
To overcome these gaps, this study aims at (1) developing a CNN-based model to obtain the 2D human skeletons from RGB image data in the video stream efficiently, (2) extracting the motion phase feature automatically and robustly, (3) proposing an evaluation framework based on this motion feature to identify and assess the working behavior in construction, and (4) establishing an experimental and safety climbing training platform for a case study of ladder climbing to demonstrates the applicability of the proposed framework.

Methods

A new two-stage vision-based method has been developed to identify workers’ normal/abnormal activities on ladder climbing using the skeleton motion phase. Fig. 1 illustrates the overall framework. Stage 1 focuses on human 2D pose estimation. The 2D skeleton data for the ROI which a human climbing the ladder is represented numerically according to the human-object-overlapping results. Stage 2 constructs a motion phase feature extraction module, which is used to obtain the periodic representation of limb movement. This feature can functionally describe the dynamical asynchronous movements of each bone which help to evaluate climbing activity.

Fig. 1. Overall framework of the proposed method.

2D Pose Estimation
Considering the complexity of construction activities like ladder climbing and the viewpoint influence may pose difficulties for accurate human pose estimation. Inspired by the idea of decoupling complex issues into several independent events to deal with, first we introduce a YOLO-based human-ladder detection module to automatically locate the ROI based on the human-object-overlapping results, then a classical stacked hourglass based module is adopted for single person pose estimation which combined a spatial transformer network to eliminate the effects of changing perspectives. Fig.2 illustrates the pipeline to implement the 2D pose estimation framework.

Fig. 2. The architecture of 2D human skeleton pose estimation.

Human-object-overlapping detection
YOLOis one of the most representative object detection models which has achieved a good balance between detection accuracy and real-time performance in the YOLO-V5 version after continuous iterative development. The state-of-the-art YOLO-V5[34] is applied as the base model to detect people and kinds of ladders targets, then combined with a human-object-overlapping detection module, we achieve to obtain the ROI of human-ladder with high accuracy at reasonable computational complexity.
The YOLO-V5 model predicts the bounding box coordinates and object class probabilities end-to-end simultaneously. After that, the people and ladders objects have been detected in the video stream by separated image frames. At the construction site, workers in potentially hazardous conditions are the most concerned issue. Hence, we propose an intuitive but effective approach to determine whether workers are in potential risks by calculating the overlap rate between workers and hazards. The recognition of objects in our work (e.g. worker, kinds of ladders) makes it possible to divide the detection objects into two groups (worker and ladder object) based on prior knowledge of the construction site. The detector generates the bounding boxes with label “people” are classified into the “worker” list Λ, and other bounding boxes with label like “straight ladder” and so on are classified into the “ladder object” list Γ. By selecting any element from “worker” (Λ) and pairing with each of “ladder object” (Γ) in sequence to guarantee every potential combination has been identified to determine our observation target. In order to achieve this, we introduce the intersection over union (IoU) between the bounding box of ith worker and the bounding box of jth object help to determine whether the selected pair of worker-object to be inspected. The decision-making process $S_ij$ is formulated as:

(1)

where Λ_i presents the area of ith worker bounding box, $Γ_j$ presents the area of jth risk object bounding box. With a threshold ε, we can select the constitute of worker-object to inspect which can improve the efficiency for construction site monitoring.
Though the data captured by monocular camera is 2D, the 3D spatial geometric relations are difficult to be characterized. However, by setting an appropriate camera perspective on-site, sufficient information can also be obtained to help us achieve the preliminary screening of targets and risk sources as our ROI.

Spatial transformed human pose estimation
In order to improve the accuracy of human pose estimation under different monitoring perspectives, we utilize the characteristics of two-dimensional affine transformation of the spatial transformer network to improve the YOLO-V5 detected ROI proposals. The spatial transformer module performs a dynamic mechanism which produces actively spatially transformation on an image, it can be expressed mathematically as:

(2)

where {$x_i^b,y_i^b$},{$x_i^a,y_i^a$} are the ROI coordinates before and after transformation, respectively.[$θ_1θ_2θ_3$] is the transform rotation matrix.
The transformation is adopted on the entire non-locally feature map and the output of the spatial transformer is then fed to a classical stacked hourglass module-based network for single person pose estimation. Given the spatial transformed RGB image frame containing people, this stacked hourglass backbone network can obtain the pixel location of human body skeleton. In addition, the final estimated human pose result is required to be remapped back to the original image coordinate. So an inverse transformer network is adopted to compute the inverse transformation rotation matrix [$γ_1γ_2γ3_$], the inverse procedure of this computation can be expressed as:

(3)

The 17-joint data of 2D human pose in the image ROI is obtained through the proposed pose estimation network. Fig.3(a) illustrates an example image of the detection results of people and ladder and visualization of 2D pose estimation. The result contains the X-Y coordinate value and predicts the confidence of each indexed bone. For the convenience of visualization, different colors are used to identify the bone, and the keypoints with low confidence are ignored in the display. Fig.3(b) illustrates the correspondence ID of human skeleton where the number represented the index, L means left side and R means right for denote convenient.

Fig. 3. Visualization of pose estimation: (a) an example image and (b) the correspondence ID of human skeleton.

Phase Feature Extraction
After obtaining the 2D position information of each joint, the motion behavior of the human body can be further analyzed. In this study, we introduce a motion phase feature that can be individually associated with each bone. Motion phase is a variable representing the time of the motion cycle which can boost our system to understand and evaluate human behavior by analyzing the periodic movements where different parts of the body move asynchronously. The fundamental motivation of introducing the motion phase is to describe and analyze the human motion by a set of multiple and independent motion phases data for each individual bone. The implementation of this motion phase feature extraction framework is described below. Fig. 4 shows the result of each step in our motion phase feature extraction.

Motion label automated detecting
In order to calculate the motion phase feature, first we have to extract the motion state of the bone which in our work is taken under the rule by detecting the bone moves or not. To reduce the labor cost of manual labeling and avoid the inconsistency due to manual labeling errors, we introduce an automated data annotation module for labeling the motion state data.
As denoted in Equation (4), we can automatically obtain the motion state label s by evaluating through a two-level conditional determination. A primary condition is checked by whether the velocity of bone is within a reasonable threshold. If this condition holds true, a secondary condition for distance move within a time series window is adopted to check. Hence another threshold for the minimum distance of bone move is introduced to filter out the inaccurate motion state data. This can be mathematically denoted as:

(4)

where $p_i^k,p_(i-1)^k$ are the position of bone k at frame i and previous frame.ΔT is the time interval between the two detection of the pose estimator. $v_{min} ,v_{max}$ are threshold parameters for primary condition adjusted according to the image resolution. And the secondary condition $thresholdd_{min}$ is adjusted according to the window of N frames. Step 1 in Fig. 4illustrates the automated motion data annotation results, the raw position information shown as the red curve and the motion labeled result in blue curve as aforementioned.

Fig. 4. phase extraction method example applied to a single bone.

Motion phase computing
Once the labeled motion state data captured through the vision-based technique described above, the motion phase of each individual bone can be computed automatically under a uniform rule based on bone movement state.
First, normalizing the original motion state function Y(t) where value set to 1 if the bone in move and 0 if it no move. We apply a z-score data normalization in a window W of N frames centered at frame i:

(5)

$μ_{S_W^j},σ_{S_W^j}$ represent the mean value and the standard deviation within that time windows respectively. After applying this normalization, the faster movements lead to a larger positive values with smaller negative values around and vice versa. This indicates that with different movement speed and frequency the motion state can still maintain consistent, see step 2 in Fig. 4.
Then, applying a Butterworth low-pass filter to the entire normalized motion state data:

(6)

where N is the order of the Butterworth low-pass filter, here we set N=3, and $w_n$ is a parameter related to the cut-off frequency which is computed based on the Shannon-Nyquist sampling theorem. After passing through this filter, a smoother motion state data curve with no loss of characteristic is obtained, see step 3 in Fig. 4.
Thirdly, the motion phase results is computed through a curve fitting process on the Butterworth filtered data, a sinusoidal function denoted as Equation (7) is applied for parameters fitting,

(7)

the objective function is parameterized by $F_i=(a_i,f_i,s_i,b_i)$ (i is the frame index), and the process of this curve fitting is minimizing the following root mean square error (RMSE) loss within a window of N frames at every frame i:

(8)

which $ϕ_i$ is motion phase at i frame of any bone. It can reflect which stage of the cyclical movement is in a certain degree. And the parameter f_i indicates the frequency of bone movement. The sinusoidal fitting reconstruction results is shown in Fig. 4 (step 4).
However, problems arise when the bones are stationary, the motion phase is rather undefined in such case. To cope with this issue, the optimized fitting amplitude parameter a_i is combined to generate a more general motion phase denoted as:

(9)

where $S_1 (⋅)$ is a smoothstep function with a left edge a_{min} and right $edgea_{max}$. The modulated motion phase with amplitude parameter becomes scaled to zero if a bone is not moving which eliminates the aforementioned shortcomings. Moreover, a smooth transition at the blurred area between motion and stillness make the motion phase more versatile. And the phase extraction results are visualized in Fig. 4 (step 5),where the height of the blue bars represents the phase, the slope of the successive bars illustrates the frequency, and the opacity of the bars illustrates the amplitude in Fig. 4 (step 6).
Since our objective function is a combination of trigonometric functions with different F_i parameters related to time index i, a direct optimization for Equation (8) with a fixed N may lead to local minimum or even don’t converge. In order to minimize the loss function locally, we adopt an optimization pipeline combined a global searching for the optimal interval of the window range N with a local optimization method using the least square algorithm. The global search first applies a least square method over the entire data, and then checks the mean and standard deviation conditions by moving the window to find a suitable local optimization interval. After that, the least square method is applied to robustly obtain the convergent parameters of our objective sinusoidal function. In summary, a motion phase feature value can be robustly and automatically extracted through our chosen approach and applied on the analyzing and evaluating the behavior from the skeleton motion of a worker, a case study will be illustrated in the next section.

Experiment and Results

Injuries from slips and falls during ladder climbing activities are common in both occupational and non-occupational environments. Normal climbing, with an object, backward-facing, reaching far are identified as four ladder climbing actions, which have been extensively studied and achieved good detecting performance. In addition to the above four climbing actions, however, the three point-of-contactlacks research. This is the basic safety recommendation of the American Ladder Institute (ALI) because it minimizes the chance of slipping and falling from the ladder.Intuitively, the worker faces a high risk in the climbing process if he does not follow the three point-of-contact rule. Hence, an experiment was designed and tested to evaluate the proposed framework by analyzing such climbing behaviors. In addition, this experimental platform can be extended to safety training for working climbing operations. Considering the accuracy of motion phase relies heavily on data collection of the human skeleton, the experiment process was designed and tested in our laboratory environment to verify the performance of pose estimation and motion phase as shown in Fig. 5.

Fig. 5. Overall framework of the experiment.

In the experiment, we set the camera to face the ladder to reduce the influence caused by the lack of 3D geometric information, which is also in line with engineering reality. The climbing behavior data consists of ascent, descent, still and working were collected by video cameras with a resolution of 1920×1080 pixels. By setting up cameras observed from different viewpoints, the comparative results can determine the robustness of our proposed pose estimation method against changes in perspective. The deep learning models were trained using PyTorch on the NVIDIA DGX-1Deep Learning System equipped with a Dual 20-core, 2.2GHz Intel Xeon E5-2698V4 CPU, 8 pieces of Tesla P100 GPU and 512G RAM for computing power support.

Pose Estimation of Human Skeleton
In our pose estimation architecture of human skeleton, we use the YOLO as our human detector combined with a human-object-overlapping module to locate the region of our interest where in this experiment is the ladder. Hence, a comprehensive dataset of images of construction ladders is needed for the model training. A subset of ladders annotated in COCO format is established which contains 1,180 images of construction ladders as shown in Table1.

Table 1. The collected dataset of various ladder, source from Internet and laboratory
From laboratory 150 130 0 0
From Internet 235 235 210 220
All collected data were divided into 850 labeled images for training, 230 labeled images for evaluating and the rest 100 images without labeling were set as the testing dataset, and then merged with the COCO dataset and pre-trained YOLO model for transfer learning. The hyper parameterswereset with a learning rate of 0.00116, a weight decay of 0.00034 and a momentum of 0.899 after a 100 generations initial searching. After training this model 1000 epoch with a 64 batch size and conducted a learning rate drop once by a factor of 5 after validation accuracy, we obtained a transferred YOLO model to detect thehuman and ladder effectively and efficiently. Then a human object overlapping strategy was adopted to locate the human proposal related to the ladder. In the next step for pose estimation, a stacked hourglass model was used as the backbone for single person pose estimator, and the spatial transformer network was adopted to eliminate the effects of changing perspectives.
For the climbing experiment, we were mainly concerned with the motion of the limbs in the vertical Y direction. The hands and feet are represented by the coordinates of the estimated wrists and ankles respectively. The influence of observing viewpoint to our pose estimation results is illustrated by a one climbing cycle example of ascent and descent as shown in Fig. 6 where pose estimation for left wrist (LW), right wrist (RW), right ankle (RA), left ankle (LA) and the Y-axis represents the pixel value of y according to the image coordinate. As we all know the occlusion leads to poor performance for object detection of vision method. Since people climbing the ladder facing forward, they will turn their back to the observing camera, which brings a lot of challenges in the occlusion conditions. The results shown in Fig. 6, indicated that our method faced much more challenges of self-occlusion (i.e., occlusion by the climber’s body) for the detection and location of hands. Compared with the results of the two monitoring perspectives, the view 1 encountered more self-occlusion interference, but the usable results could still be obtained. In spite of this, the feet position and the movement trend of hands can be obtained clearly, this shows that the predicted human skeleton data to extract the motion feature information are uniformly stable under certain conditions. After experimental testing, the average detection time of a single frame is 25–30ms, which is affected by the number of target objects in the image.

Fig. 6. Pose estimation example of one climbing cycle from two monitoring perspectives.

Table 2. Experiment results on the test set, pose estimation presented the hands and feet only
Model YOLO-V5 with human-ladder-overlapping Spatial transformed stacked hourglass
SSD [35] 96.1 95.3 94.6 92.1 92.2 - - - -
Fast R-CNN [36] 99.3 95.8 96 94.5 95.1 - - - -
OpenPose [37] - - - - - 91 89.5 87.8 89.3
This work 99.1 96.2 95.7 94.3 94.6 90.5 90.7 89.1 88.9
This two-staged pose estimation framework was evaluated by the constructed ladder dataset for object detection and MPII for pose estimation, and the results are presented in Table 2. Comparative experiments on multiple classic networks including object detection model SSD[35], Fast R-CNN[36] and pose estimation model OpenPose[37] were taken. Tests revealed the ability of our pose estimation module to detect the human and ladder object, locate the ROI proposal and estimate the skeleton acceptable.

Motion Phase of Climbing Behavior
At all times during ascent, descent, and working, the climber should face the ladder and have two hands and one foot, or two feet and one hand in contact with the ladder steps, rungs and/or side rails. The improper climbing posture creates climber clumsiness and may cause a fall. Fig. 7(a) illustrates the precautions during the climb to reduce the chances of falling by utilizing three point-of-contact climbing. In addition, there are two more habitual climbing postures in our daily lives shown in Fig. 7(b) and 7(c). Several sets of experiments were carried out to examine the proposed framework of regularity extracting the motion feature of human body movement in the climbing process and applying this motion feature to evaluate the climbing behavior. A fixed straight ladder with a height of 2.85 meters was set up in the experimental area for experimenter simulated typical climbing behaviors for data collection. Based on this straight ladder, one cycle of climbing experiment consist of an ascent and descent process, and for each class of climbing postures, 10 cycles of experiment data were collected to accommodate the potential for bias and reflect the uniform robustness for identifying and evaluating the indistinguishable climbing posture. Through experimental tests, 10–15msis required per frame for motion phase computing after obtained the 2D skeleton data. Therefore, the total execution time of each data frame is about 35–45ms, which means it can reach an average processing speed of 25FPS (frames per second).

Fig. 7. Examples of three type ladder climbing action: (a) shows a climber maintainingthree point-of-contact, (b) shows the climbing postures with the same side hand and foot moving simultaneously which is denoted as the same-side-step climbing, and (c) we call this a diagonal-step climbing.

A general experimental result of a three point-of-contact climbing cycle is illustrated in Fig. 8. The visualization of motion phase sequence of the limbs can clearly express the cyclic movement of the right hand (RH)-right foot (RF)-left hand (LF)-left foot (LH) during ascent, and vice, a cyclic movement of the LF-LH-RF-RH during descent. As we mentioned above, the phase value is represented by the height of the blue bars, the slope of this successive bars illustrates the frequency and the opacity of the bars illustrates the amplitude. Hence, each activated motion of the hands or feet can be represented by the motion phase, then the synthesis on time series can be used to analyze the characteristics and standardization of the actions.

Fig. 8. Experimental results of three point-of-contact climbing.

According to the experimental data of three point-of-contact climbing, given the time index t, a motion phase value p_t can be obtained, the motion phase differenceΔp of each limb can indicate the sequence of motion. In addition, the stable motion phase difference between four limbs indicates that this climbing posture is some kind of four-beat rhythm movement. On the contrary, the other two typical climbing postures perform in a two-beat motion mode.Therefore, in the case of ladder climbing we are concerned about, whether the worker performs their climbing behavior obey safety three point-of-contact rule can be identified by:

(10)

where $S_r(t)$ is the safety performance result at timet. The index of limbs arei, $j$. And the $P_Threshold$ is a condition threshold to identify whether the two limbs movesynchronously. In this ladder climbing case, if the worker perform three point-of-contact climbing, $S_R$ = 1. Otherwise, $S_R$=0.
In these two typical climbing modes in our daily life, one hand and one foot always move simultaneously each time, where the difference between them is whether the hand and foot are on the same side or not. As shown in Fig. 9, the same-side-step climbing with an alternate movement of the paired (RH,RF) and (LH,LF), and in Fig. 10, the movement followed a similar way with another combination of (RH,LF)-(LH,RF). The paired hand and foot have a similar phase of motion, while the phase difference between each pair is obvious. Hence, climbing in this two-beat rhythm pattern can be well represented by the characteristics of its action sequence to be distinguished well. According to the motion analysis of each bone, we found that the frequency of limbs during the climbing process is basically consistent. As for climbing efficiency, compare with these two-beat climbing postures, the three point-of-contact has the longest contact time with ladder steps, rungs and/or side rails during climbing, thus improving the climbing stability and safety at the expense of efficiency. However, according to the feedback of various experimenters, the utilization of three point-of-contact climbing is not habitual, natural and efficient. Since it has been noticed in construction sites where workers breakrules to make their work more efficient, the application of this motion phase feature to help workers develop proper safety habits and awareness through training seems necessary. In the case of ladder climbing, people’s climbing follows the above three patterns or alternates during the climbing process, each pattern has specific sequential characteristics of movement which is suitable to use the motion phase to extract this temporal motion feature. Furthermore, it can be applied to distinguish irregular or inappropriate actions, evaluate the climbing habit of the observed worker and provide feedback to improve their climbing safety behavior.

Fig. 9. Experimental results of same-side-step climbing.

Fig. 10. Experimental results of diagonal-step climbing.

Discussion

For construction safety management, the observation of working behaviors has been important but difficult to implement on-site. Compared to other non-vision sensor-based methods like UWB, RFID, IMU, etc., the no contact and maker-less vision technologies are more in line with the requirements for worker safety monitoring at construction sites. Meanwhile, instead of the traditional time and labor-consuming way of human observation, the proposed framework provides a complementary automated means to monitor and evaluate workers’ behaviors through the cameras near to the activities that need to be observed. The results of the experiments demonstrate that the proposed framework technically functions well for extracting a novel motion phase feature from videos, and for evaluating whether the worker performs their behavior obey the standard instruction or rule. With the ability to analyze the dynamical asynchronous movements of the human body based on individually computed motion phase data for each bone, a lot of complex working activities can be encoded for further applications. For example, this approach can be served for safety training and education, a mandatory training process for those high-risk jobs workers can be conducted, by gathering motion samples for a couple of hours during training without significant human efforts and errors, then assessing and analyzing their behavior video data semi or fully automatically. The acquired evaluation result with video recording can be used to provide the trainee with direct visual feedback and be used as a tool for safety education to instruct them to perform their work in a safe manner. Hence, our proposed framework provides the possibility to extend and implement more functions and applications for camera sensors using on construction sites. In addition, a strategy based on human-object detecting is adopted in our proposed method which promotes our deep learning pose estimation module to locate the region of interest for safety management efficiently and effectively. This inspires a further study to explore a combination between human work and risk objects. Besides construction safety, this research also has great value in the estimation of worker’s productivity through interpreting periodic working states of different types of construction activities based on surveillance videos.
The above lists several advantages of our proposed work, while some limitations identified in this study are summarized as follows. First, compared with the non-vision sensor-based method, the presence of occlusions by other workers and objects may lead to the adverse impact of visual motion sensing and analysis results. Moreover, the experiment results were verified by the data collected in the laboratory environment of the climbing behavior, in order to enact real-time safety monitoring on-site, much more various types of unsafe behavior under various construction scenes need to be investigated in depth to improve the applicability of our framework suitable for actual construction environments.

Conclusion

To explore and provide a complementary means to detect and evaluate unsafe behavior, a new two-stage approach has been proposed to detect construction workers who do not follow the three point-of-contact climbing in the indoor experimental and safety training platform. Three algorithms were developed: (1) a YOLO-based network to detect the presence and location of worker-climb-ladder ROI; (2) a spatial transformed CNN model to estimate 2D human pose information and skeleton data; and (3) anextracted novel motion phase feature described the time-series progression of bone movement to identify whether the workers are in a high safety performance while climbing the ladder.At the first stage, the average precision of object detection and pose estimation was 96.0% and 89.8%, respectively. And it could reach an average processing speed of 25FPS. In the second stage, the dynamical movements of each bone can be described as phase, amplitude, frequency obtained by motion phase feature to perform more complex motion performance analysis. The results indicate that the proposed framework can potentially be applied tothe automatic detection and analysis of real-time monitoring systems on construction sites, as shown in the experiments of case studies on ladder climbing.
Despite not being able to identify the abnormal working behavior with 100% accuracy, the developed vision-based skeleton motion phase approach can provide several benefits of safety management for their daily practice. Firstly, this safety training platform combined with proactive behavior-based safety management strategies can correct irregular actions and improve workers’ safety awareness during the training process. And secondly, the proposed approach can provide a wide range of non-intrusive simultaneous monitoring on construction sites without disturbing workers and reduce the time and laborconsumption.
Hence, on the purpose of real-time monitoring on construction sites to manage unsafe behavior, one of the future work lays on improving the optimum algorithm which enables to detect more various working behaviors more accurately without losing the computational efficiency, such as carrying heavy objects and painting walls. At the same time, it is necessary to do more experiments in real sites to verify the interference of image noise and human object interaction in the complex construction environment. Another issue is the further exploration to introduce this motion phase data into a RNNs model to extract more generic features to distinguish abnormal behaviors. Moreover, armed with the ability to analyze the worker’s behavior through dynamical asynchronous movements of the 17-joints’ motion phase, it will be interesting to extend our framework to assessing and forecasting the risk of work-related musculoskeletal disorders with more professional analysis.

Acknowledgements

Not applicable.

Funding

The presented work is supported by the GDAS’ Project of Science and Technology Development (No. 2020GDASYL-20200302015) and funded by the Research Fund Program of Guangdong Key Laboratory of Modern Control Technology (No. 2017B030314165).

Author’s Contributions

Conceptualization, ZC, LW. Funding acquisition, ZC, LW, HH. Investigation and methodology, ZC, LW. Project administration, ZC. Resources, LW. Supervision, HH. Writing of the original draft, ZC, LW. Writing of the review and editing, ZC, HH. Software, ZC, LW. Validation, LW. Formal analysis, ZJ, HH. Data curation, LW, ZC. All the authors have proofread the final version.

Competing Interests

The authors declare that they have no competing interests.

Author Biography

Name : Zaili Chen
Affiliation :
1.Faculty of Engineering, China University of Geosciences (Wuhan)
2.Guangdong Key Laboratory of Modern Control Technology, Institute of Intelligent Manufacturing, GDAS
Biography :
Zaili Chen received the B.S. degree in automation and a M.S. degree in control science and engineering from Harbin Institute of Technology, and a PhD candidate in safety science and engineering at China University of Geosciences (Wuhan). Since 2014, he has been an algorithm engineer at the Institute of Intelligent Manufacturing, Guangdong Academy of Sciences. His research interests include control safety, security management and risk assessment based on deep learning.

Name : Li Wu
Affiliation : Faculty of Engineering, China University of Geosciences(Wuhan)
Biography :
Wu Li, currently a Professor in Faculty of Engineering, China University of Geosciences. He is vice chairman of the second session of the Chinese Society of Geological Hazards Prevention Engineering Professional Committee. His research interests include underground construction engineering design and safety management, blasting vibration monitoring, slope stability analysis, etc.

Name : Huagang He
Affiliation : Faculty of Engineering, China University of Geosciences (Wuhan)
Biography :
Huagang He, Associate professor, PhD.He is currently the director of the Department of Safety Engineering of China University of Geosciences (Wuhan), the secretary-general of China Geological Exploration Safety Association. Research interests : Engineering safety risk analysis and control, BIM technology application, etc.

Name : ZEYU JIAO
Affiliation : Guangdong Key Laboratory of Modern Control Technology, Institute of Intelligent Manufacturing, GDAS
Biography :
ZEYU JIAO was born in Qianjiang, Hubei, China in 1991. He received the B.S. and Ph.D. degree in management science and engineering from Beihang University in 2015 and 2020, respectively. Since 2020, he has been an assistant researcher with the Institute of Intelligent Manufacturing, Guangdong Academy of Sciences. His research interests include computer vision, 3D perception and the application of deep learning.

Name : Liangsheng Wu
Affiliation : Guangdong Key Laboratory of Modern Control Technology, Institute of Intelligent Manufacturing, GDAS
Biography :
Liangsheng Wu has been a researcher of Intelligent Manufacturing Research Institute of Guangdong Academy of Sciences since 2013. His research interests include image processing and deep learning.

References

[1] H. W. Heinrich, Industrial Accident Prevention. A Scientific Approach. New York, NY: McGraw-Hill, 1941.
[2] H. Jebelli, C. R. Ahn, and T. L. Stentz, “Fall risk analysis of construction workers using inertial measurement units: validating the usefulness of the postural stability metrics in construction,” Safety Science, vol. 84, pp. 161-170, 2016.
[3] W. Fang, B. Zhong, N. Zhao, P. E. Love, H. Luo, J. Xue, and S. Xu, “A deep learning-based approach for mitigating falls from height with computer vision: convolutional neural network,” Advanced Engineering Informatics, vol. 39, pp. 170-177, 2019.
[4] A. Kaur, N. Rao, and T. Joon, “Literature review of action recognition in the wild,” 2019 [Online]. Available: https://arxiv.org/abs/1911.12249.
[5] D. Cao, Z. Chen, and L. Gao, “An improved object detection algorithm based on multi-scaled and deformable convolutional neural networks,” Human-centric Computing and Information Sciences, vol. 10, article no. 14, 2020.https://doi.org/10.1186/s13673-020-00219-9
[6] D. Holden, T. Komura, and J. Saito, “Phase-functioned neural networks for character control,” ACM Transactions on Graphics (TOG), vol. 36, no. 4, article no. 42, 2017. https://doi.org/10.1145/3072959.3073663
[7] M. W. Khan, Y. Ali, F. De Felice, and A. Petrillo, “Occupational health and safety in construction industry in Pakistan using modified-SIRA method,” Safety Science, vol. 118, pp. 109-118, 2019.
[8] T. E. McSween, Values-Based Safety Process: Improving Your Safety Culture with Behavior-Based Safety. Hoboken, NJ: John Wiley & Sons, 2003.
[9] X. Li and H. Long, “A review of worker behavior-based safety research: current trends and future prospects,” IOP Conference Series: Earth and Environmental Science, vol. 371, no. 3, article no. 032047, 2019. https://doi.org/10.1088/1755-1315/371/3/032047
[10] A. Asadzadeh, M. Arashpour, H. Li, T. Ngo, A. Bab-Hadiashar, and A. Rashidi, “Sensor-based safety management,” Automation in Construction, vol. 113, article no. 103128, 2020. https://doi.org/10.1016/j.autcon.2020.103128
[11] B. Choi, S. Hwang, and S. Lee, “What drives construction workers' acceptance of wearable technologies in the workplace? Indoor localization and wearable health devices for occupational safety and health,” Automation in Construction, vol. 84, pp. 31-41, 2017.
[12] R. Jin, H. Zhang, D. Liu, and X. Yan, “IoT-based detecting, locating and alarming of unauthorized intrusion on construction sites,” Automation in Construction, vol. 118, article no. 103278, 2020. https://doi.org/10.1016/j.autcon.2020.103278
[13] S. Lee and D. Park, “A real-time abnormal beat detection method using a template cluster for the ECG diagnosis of IoT devices,” Human-centric Computing and Information Sciences, vol. 11, article no. 4, 2021. https://doi.org/10.22967/HCIS.2021.11.004
[14] M. A. Perrott, T. Pizzari, J. Cook, and J. A. McClelland, “Comparison of lower limb and trunk kinematics between markerless and marker-based motion capture systems,” Gait & Posture, vol. 52, pp. 57-61, 2017.
[15] Y. Yu, W. Umer, X. Yang, and M. F. Antwi-Afari, “Posture-related data collection methods for construction workers: a review,” Automation in Construction, vol. 124, article no. 103538, 2021.https://doi.org/10.1016/j.autcon.2020.103538
[16] S. Han and S. Lee, “A vision-based motion capture and recognition framework for behavior-based safety management,” Automation in Construction, vol. 35, pp. 131-141, 2013.
[17] W. Fang, P. E. Love, H. Luo, and L. Ding, “Computer vision for behaviour-based safety in construction: a review and future directions,” Advanced Engineering Informatics, vol. 43, article no. 100980, 2020.https://doi.org/10.1016/j.aei.2019.100980
[18] B. H. Guo, Y. Zou, Y. Fang, Y. M. Goh, and P. X. Zou, “Computer vision technologies for safety science and management in construction: a critical review and future research directions,” Safety Science, vol. 135, article no. 105130, 2021.https://doi.org/10.1016/j.ssci.2020.105130
[19] J. Seo, S. Han, S. Lee, and H. Kim, “Computer vision techniques for construction safety and health monitoring,” Advanced Engineering Informatics, vol. 29, no. 2, pp. 239-251, 2015.
[20] Y. Kong and Y. Fu, “Human action recognition and prediction: a survey,” 2018 [Online]. Available: https://arxiv.org/abs/1806.11230.
[21] L. L. Presti and M. La Cascia, “3D skeleton-based human action classification: a survey,” Pattern Recognition, vol. 53, pp. 130-147, 2016
[22] L. Ding, W. Fang, H. Luo, P. E. Love, B. Zhong, and X. Ouyang, “A deep hybrid learning model to detect unsafe behavior: integrating convolution neural networks and long short-term memory,” Automation in Construction, vol. 86, pp. 118-124, 2018.
[23] W. Fang, L. Ding, H. Luo, and P. E. Love, “Falls from heights: a computer vision-based approach for safety harness detection,” Automation in Construction, vol. 91, pp. 53-61, 2018.
[24] S. J. Ray and J. Teizer, “Real-time construction worker posture analysis for ergonomics training,” Advanced Engineering Informatics, vol. 26, no. 2, pp. 439-455, 2012.
[25] E. Ohn-Bar and M. Trivedi, “Joint angles similarities and HOG2 for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, 2013, pp. 465-470.
[26] P. Wang, Z. Li, Y. Hou, and W. Li, “Action recognition based on joint trajectory maps using convolutional neural networks,” in Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 2016, pp. 102-106.
[27] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View adaptive recurrent neural networks for high performance human action recognition from skeleton data,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 2136-2145.
[28] C. Li, P. Wang, S. Wang, Y. Hou, and W. Li, “Skeleton-based action recognition using LSTM and CNN,” in Proceedings of 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, 2017, pp. 585-590.
[29] E. A. Nadhim, C. Hon, B. Xia, I. Stewart, and D. Fang, “Falls from height in the construction industry: a critical review of the scientific literature,” International Journal of Environmental Research and Public Health, vol. 13, no. 7, article no. 638, 2016. https://doi.org/10.3390/ijerph13070638
[30] L. S. Robson, H. Lee, B. C. Amick III, V. Landsman, P. M. Smith, and C. A. Mustard, “Preventing fall-from-height injuries in construction: effectiveness of a regulatory training standard,” Journal of Safety Research, vol. 74, pp. 271-278, 2020.
[31] M. Liu, S. Han, and S. Lee, “Tracking-based 3D human skeleton extraction from stereo video camera toward an on-site safety and ergonomic analysis,” Construction Innovation, vol. 16, no. 3, pp. 348-367, 2016.
[32] S. Han, S. Lee, and F. Pena-Mora, “Vision-based detection of unsafe actions of a construction worker: case study of ladder climbing,” Journal of Computing in Civil Engineering, vol. 27, no. 6, pp. 635-644, 2013.
[33] L. Ding, W. Fang, H. Luo, P. E. Love, B. Zhong, and X. Ouyang, “A deep hybrid learning model to detect unsafe behavior: integrating convolution neural networks and long short-term memory,” Automation in Construction, vol. 86, pp. 118-124, 2018.
[34] G. Jocher, A. Stoken, J. Brorovec, A. Chaurasia, T. Xie, C. Liu, et al., “ultralytics/yolov5: v5.0 - YOLOv5-P6 1280 models, AWS, Supervise.ly and YouTube integrations,” 2021 [Online]. Available: https://doi.org/10.5281/zenodo.4679653.
[35] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S., Reed, C. Y. Fu, and A. C. Berg, “SSD: single shot multibox detector,” Computer Vision - ECCV 2016. Cham, Switzerland: Springer, 2016, pp. 21-37.
[36] R. Girshick, “Fast R-CNN,” in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 2015, pp. 1440-1448.
[37] Z. Cao, G. Hidalgo, T. Simon, S. E. Wei, and Y. Sheikh, “OpenPose: realtime multi-person 2D pose estimation using part affinity fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 172-186, 2019.

Zaili Chen1,2, Li Wu1,*, Huagang He1, Zeyu Jiao2, and Liangsheng Wu2, Vision-based Skeleton Motion Phase to Evaluate Working Behavior: Case Study of Ladder Climbing Safety, Article number: 12:01 (2022) Cite this article 2 Accesses