ArticlesAll Issue
ArticlesMulti-Scale Keypoints Feature Fusion Network for 3D Object Detection from Point Clouds
• Xu Zhang1,*, Linjuan Bai1, Zuyu Zhang2, and Yan Li2

Human-centric Computing and Information Sciences volume 12, Article number: 29 (2022)
https://doi.org/10.22967/HCIS.2022.12.029

Abstract

Lidar based object detection is utilized in numerous applications. The point-based and voxel-based methods employ furthest point sampling (FPS) algorithms and feature up-sampling to reduce the high computation cost respectively. In addition, aggregating all feature vectors for prediction leads to a large cost, the object confidence and location estimate are affected. Therefore, we propose a novel multi-scale keypoints feature fusion framework for 3D object detection, which take advantage of the 3D voxel convolutional neural network and PointNet-based set abstraction to learn more discriminative point features. A feature FPS set abstraction module is proposed to aggregate 3D voxel-wise features which can handle point loss and redundancy with learning complex features. A multi-scale feature fusion strategy is used to acquire context information and location information with multiple receptive fields, which reduces the computation cost. Finally, a refine prediction head is designed to improve box refinement and confidence prediction. We evaluate our model on benchmark KITTI which exhibit good performance and achieved 2.09% improvement in car hard difficulty. The feature FPS set abstraction module and fusion strategy outperform the state-of-the-arts work by 2.13% in all categories at different difficulties.

Keywords

Point Cloud, 3D Object Detection, Multi-Scale, Keypoints, Feature Fusion, Furthest Point Sampling

Introduction

In recent years, lidar-based object detection, the same as RGB-D based and Lidar-RGB based 3D object detection, has been receiving increasing attention in numerous applications such as autonomous driving and robotics. Lidar sensors are widely adopted in autonomous driving vehicles and robots for capturing 3D scene information as sparse and irregular point clouds [1]. Decreasing the occurrence of accidents with the assistance of science and technology has become a significant research topic in computer vision and intelligent vehicle infrastructure cooperative systems (IVICS). With the widespread popularity of 3D object detection in autonomous driving, augmented reality, and robotics. Lidar sensors gradually enter people's horizons to capture much richer spatial information and structural information of the surrounding environment which provides vital clues for 3D scene perception and understanding [2]. Recently, 2D object detection has been made huge breakthroughs [3]. Applications for object detection are expanding [4]. However, it is impossible to directly transform 2D approaches to 3D object detection on account of point clouds is unstructured distribution, sparse, irregular, and locality sensitive [5]. Most existing 3D detection methods could be classified into two categories in terms of sensor amount: multisensor methods [6-11] and lidar-only methods [12-23]. The lidar-RGB-based method, which belongs to multi-sensor methods, aims at seeking an efficient solution from images to promote high-performance object detection. The lidar-only method throws away complicated matches between the point clouds and images by using the point directly. Three mainstream methods to deal with point cloud are slowly derived: voxel-based method [12-17], point-based method [18-20], and graph-based method [21-23]. Pointbased methods take raw data as input and predict bounding boxes. Voxel-based methods conduct voxelization on the entire point cloud to abstract each single voxel feature by 3D or 2D convolutional neural networks (CNN). The graph-based methods apply spatial information by establishing the relationship between vertex (point) and edge (the relationship between points pair). Subsequently, graph neural network (GNN) is used to learn point features for 3D detection. The aforementioned methods such as voxel-based methods are more computationally efficient suffer from inevitable information loss during voxelization and encounter a performance bottleneck. The pointbased method considers the spatial distance to select keypoints and ignores local feature correlation. As results, the height of the bounding box, occlusion, and truncation levels have a significant impact on the recognition. The reason is that these methods lose information in the downsampling process. Furthest point sampling (FPS) based on 3D Euclidean distance, not only ground-truth with few interior points lose points [20], but the survival sampling points also have a lot of redundancy. The subsequent classification and prediction are based on the feature extraction of the points, which causes the problem of fewer effective keypoints. For small objects with high occlusion and truncation levels, the recognition performance is affected. Selection algorithms are used to remove the irrelevant features from the data set. Kishor and Chakraborty [24] used the fast correlation-based filter feature (FCBF) selection technique to find the best features. Grape leaf disease detection network (GLDDN) [25] is proposed that utilizes dual attention mechanisms for feature evaluation, detection, and classification. X-view [26] proposed a nonegocentric multi-view detection method to overcome the grid partition coarse of the multi-view. To address the above issues, we propose a novel two-stage 3D object detection network to predict 3D bounding boxes and object categories for each instance in the point cloud scene. The multi-scale keypoints feature fusion network combines the advantages from both the 3D voxel convolutional neural network and PointNet-based set abstraction to learn more discriminative point cloud features. Firstly, we voxelization the whole point cloud with 3D sparse convolution for voxel-wise feature learning and proposal generation. A small set of keypoints are selected by the 3D Euclidean distance FPS to summarize the overall 3D location information from the voxel-wise features in the parallel time. Secondly, a subset of keypoints is picked up by using 3D Euclidean feature FPS. The subset of the keypoints feature comes from 3D sparse convolution by grouping the neighboring voxel-wise features. As a result, it can effectively preserve interior points of various instances. Different from previous work. We choose the number of different measured keypoints in the sampling process for the minimum representative points. We abandon concatenating entire 3D sparse convolution features or other multi-perspective mapping features as before. Instead, we present a fusion strategy to better use different receptive fields. In the end, we conduct keypoints feature set abstraction to aggregate the features to object classification and location regression. In summary, our primary contribution is manifold:

We propose a feature FPS set abstraction module to aggregate 3D voxel-wise features above surrounding voxel volume, several minimum representative points will be selected.

A multi-scale feature fusion strategy is designed in set abstraction layers after the point cloud voxelization, which can extract accurate context information and location information.

We design a refine prediction head, making the proposed framework use different keypoints feature to improve box refinement and confidence prediction respectively

The remaining parts of this paper are organized as follows. In Section 2, related works of 3D object detection are reviewed. Section 3 presents our framework and experiments. Finally, Section 4 gives the conclusion of this paper.

Related Work

Existing work can be classified into two categories in terms of sensor type: multi-sensor methods and single-sensor methods. In this paper, we focus on lidar-only method that belongs to Single-sensor methods. Point clouds can be scanned directly from light detection and ranging (LIDAR) technique [27]. The lidar-only object detection methods can be categorized as voxel-based method, point-based method, and graph-based method. All those can be grouped into two categories according to the pipeline: onestage and two-stage pipeline for 3D object detection. The technical routes for lidar-only method is shown in Fig. 1.

Fig. 1.Overall architecture of the StudModel approach.

Multi-Sensor based 3D Object Detection

As a consequence of sensor development, quantity transportations are equipped with multiple sensors, such as lidar and camera. Laser scanners have accurate depth information, coming with the character of sparse unordered, and locality sensitive. Cameras provide more semantic information with weak depth and location information. The combination of two different sensors is perhaps an elegant solution to achieve high performance. Series works [6-11] take advantage of the image to detect 2D bounding box previous or project point cloud into multiple views and extract the corresponding features. Multi-view 3D network (MV3D) [6] generates candidate boxes in the bird’s eye view (BEV) representation with a fixed height from the point cloud. Aggregate view object detection network (AVOD) [9] improves MV3D by introducing image features into the candidate boxes generation module to improve MV3D using fixed height. Frustum-PointNet [7] attempts to generate 2D detection boxes from the given images and stretch the 2D detection box to the frustum plane of the point cloud to cut off the non-frustum plane to reduce calculation. Multi-task multi-sensor fusion network (MMF) [11] integrates depth map, point cloud, and image information to complete multiple tasks including depth information detection, 2D object detection, and 3D object detection. PFF3D [28] presents a fusion method to exploit both LIDAR and camera data. Those aforementioned tasks need to be calibrated and synchronized.

Lidar-Only based 3D Object Detection
Combining image and point cloud for object detection has bottlenecks with data alignment. The use of a truncated point cloud under the frustum plane for object detection is limited by the performance of the 2D detection boxes. To avoid this circumstance, there are three mainstream typical skills used for feature learning: voxel-based, point-based, and graph-based networks.
Voxel-based: On the one hand, in the 2D object detection domain, CNNs are used to extract features. On the other hand, the point cloud irregular format and non-uniformity introduces a challenge for feature learning. Numerous researchers subdivide raw point cloud to equally distributed which can be processed by 3D CNN like as 2D CNN. VoxelNet [12] uses voxelization to tackle point cloud firstly. SECOND [13] introduced an efficient 3D voxel process using 3D sparse convolution. Consequently, feature learning in dense data can be generalized to sparsely uniformly distributed sampled regions thanks to voxelization.
Point-based:Although using voxelization can embed the entire scene with voxels. Nevertheless, renders data unnecessary information loss and causes issues. Rather than passing to an intermediate regular representation. PointNet [29] provides non-trivial thinking by using neural networks to extract unordered points features directly. They take raw point as input and use multi-layer perceptron (MLP) to map low dimensional feature to high dimensional feature space to ensure the network invariance under transformations. PointNet++ [30] is a variant of PointNet which proposes hierarchical aggregation of point features and uses iterative FPS to choose a subset of points to cover the entire point set for further feature extraction. Many works take PointNet and PointNet++ as a backbone and design different models to achieve better performance. STD [19] proposes spherical anchors aiming to retain location information and use PointNet++ as the backbone to extract semantic context features. 3DSSD [20] uses PointNet to sample the keypoints with the feature distance on the point cloud.
Graph-based: The common problem of using PointNet++ as a backbone network or set abstraction module is the grouping and sampling points repeatedly. GNN was first proposed by Gori et al. [31] . It is a neural network that directly acts on the graph structure. It is a natural generalization of CNNs in nonEuclidean space. Dynamic graph CNN (DGCNN) [32] seems the first module adopts GNN to feature extraction on point clouds including classification and segmentation in our knowledge. EdgeConv was proposed to act on graphs dynamically computed in each layer of the network. SawNet [33] combines the cogitation of PointNet based on DGCNN to learn local and global features in point cloud classification and segmentation tasks. Point-GNN [22] uses voxel down sample point cloud for the graph construction, they regard the point as vertex and use fixed radius reconstruction a graph, the neural network on sets to extract features, an auto-registration mechanism was proposed to ensure invariance under transformations. Sparse voxel-graph attention network (SVGA-Net) [23] constructs the local complete graph within each spherical voxel and the global KNN graph to learn the discriminative feature representation. S-AT GCN [34] proposes a spatial-attention graph convolution to form the feature enhancement layer. As for the two-stage pipeline, they employ images and point clouds, or only point as input to generate 2D bounding boxes. Point cloud converts to point-wise or voxel-wise representation at the same time. Set abstraction (SA) layers are used on point features representation for extracting context information. 3D region proposal network (RPN) is adopted for generating a series of rough proposal boxes. Based on those proposal boxes, a refinement module is developed to find definitive predictions at the second stage. As for the one-stage pipeline, they provide a straightforward solution to category classification and location regression, accompanying by enduring a serious imbalance of positive and negative proposal boxes. Notably, be different from the two-stage pipeline, FP layers and the refinement stage are figured that consumes half of the inference time. 3DSSD [20] removes FP layers and the refinement stage to prediction bounding box directly. Our work differs from previous work by using distance furthermost point to represent the difference between the points in the whole scene for the two-stage model. The motivation lies in the fact that using distance furthermost point contains more location information and feature furthermost point contains more semantic information. These two different kinds of points possibly affect the classification and regression of objects respectively. Effective coding representation of multi-scale features based on voxel operation creates favorable conditions for selecting different kinds of keypoints under multi-scale features. We reckon that combining the advantages from both the 3D voxel CNN with different sampling strategies and PointNet-based set abstraction is the right procedure to classification and regression objects, respectively.

Framework

In this section, we analyze and introduce the main modules of the 3D point cloud object detection model. First, we introduce “distance FPS voxel set abstraction module” and “feature FPS voxel set abstraction module”; then we propose a box prediction module for proposal generation. Finally, we present a multi-scale keypoints feature fusion strategy and refine prediction head (Fig. 2).

Fig. 2.The overall architecture contains four main components. (1) The backbone network includes 3D voxelization, 3D sparse convolution, and the BEV projection. (2) Distance FPS voxel set abstraction module. (3) Feature FPS voxel set abstraction module. (4) Refine prediction head.

Distance FPS Voxel Set Abstraction Module
To encode voxel features, we need to obtain keypoints. Specifically, we use the distance FPS (D-FPS) method to select a point cloud sequence from the point cloud P. The spatial distance measurement is formulated as.

(1)

D-Distance represents the L2 distance between two points. 𝑋 and 𝑌 represent coordinates and reflection intensity from different points. A small number of representative points d-keypoint sequence 𝑑𝑝 = {𝑝1, 𝑝2, 𝑝3∙∙∙ 𝑝𝑝} is generated with D-FPS, which represent the distribution of position information in the point cloud space. A 3D voxel sparse CNN is frequently used as the backbone network for point cloud feature extraction because of its efficiency and accuracy and have the ability to generate highquality 3D propose boxes. Use the 3D CNN to divide the points into 𝐿 × 𝑊 × 𝐻 voxels with the same intervals. Then calculate the points in the voxel as the average value of 3D coordinates and reflection intensity 𝑥, 𝑦, 𝑧, 𝑟 of all internal points or use other calculations methods to obtain unique eigenvalues for the voxel directly. We adopt the mean value of each voxel to encode the features in this paper. Different convolution kernels and step sizes for CNNs are used to {1𝑥, 2𝑥, 4𝑥, 8𝑥} downsample. The position information of the keypoints obtained by sampling is mapped to the voxel feature space of the corresponding position to ensure that the keypoints have and only one corresponding voxel. Treat each voxel as a point, the set abstraction proposed by PointNet++ [30] is adopted for the aggregation of the voxel-wise feature. The voxel-wise feature vectors formulated reference PV-RCNN [1] as

(2)

where $S_i^{(lk)}$ indicate the $K^{th}$ layer feature of the number i distance-measured keypoint in voxel space. denote as the set of voxel-wise feature vectors in the $K^{th}$ layer of 3D voxel denote as the d-keypoint sequence 3D coordinates calculated by the voxel indices and actual voxel sizes of the $K^{th}$ layer. $r^k$ representative non-empty voxels at the $K^{th}$ layerwithin a fixed radius. PointNet [29] unit will be adopted as

(3)

where $df_i^{pvk}$indicate the $K^{th}$ layer feature of the number i distance-measured keypoint after set abstraction.
Feature FPS Voxel Set Abstraction Module
Determine the position distribution of the points through the points obtained by D-FPS, which usually contains a large number of background points and invalid points. This is explained by the fact that DFPS only considers 3D location characteristics. The sampled points are scattered in the scene, which is beneficial to object regression but not conducive to the object classification, because the sampled point is a tiny subset sequence of the raw point cloud. A few points in this subset sequence fall in the groundtruth, which greatly weakens the points that can be referred to for object classification. To deal with the dilemma, we extend the “distance FPS voxel set abstraction module” to a more general feature extraction module. Specifically, the distance between points is used as a 3D feature sequence, and we extend the 3D distance feature to higher-dimensional features space 𝑁 to obtain semantic information. Thinking of distance sampling as a special variant of feature sampling. When 𝑁 is equal to 4 and the feature contains 𝑥, 𝑦, 𝑧, 𝑟, “feature FPS voxel set abstraction module” turns to “distance FPS voxel set abstraction module.” The spatial feature measurement is formulated as

(4)

F-Distance represents L2 feature distance between two distance-measured keypoints. 𝑋 and 𝑌 represent 𝑁 dimensional features from different distance-measured keypoints. Otherwise, 𝑋 and 𝑌 belong to $dp= {p1, p2, p3…p_p}$ After feature FPS (F-FPS), we could get the f-keypoint sequence $fp= {p1, p2, p3…p_q}$ and satisfies the constraint Fp $\subseteq$ DP. In this way, we can receive different semantic information under the coverage of the scene by sampling points after D-FPS. A smaller subset sequence of keypoints is picked up to represent the distribution of the semantic information of the point cloud. The same as the distance FPS voxel set abstract module, the keypoints at this time are replaced with a smaller subset sequence of feature keypoints. The voxel-wise feature vectors formulated as

(5)

where $df_i^{pvk}$, indicate the $K^{th}$ layer feature of the number i feature-measured keypoint in voxel space denote as the set of voxel-wise feature vectors in the $K^{th}$ layer of 3D voxe denote as the f-keypoint sequence 3D coordinates calculated by the voxel indices and actual voxel sizes of the $K^{th}$ layer. $T^K$ representative non-empty voxels at the $K^{th}$ layer within a fixed radius. PointNet [29] unit will be used as

(6)

where $ff_i^{pvk}$ indicate the $K^{th}$ layer feature of the number i feature-measured keypoint after set abstraction
Feature FPS Voxel Set Abstraction Module
After the 3D sparse CNN implemented by several SA layers intertwined with key point sampling. The 3D voxel features after 8x downsampling are projected to the 2D BEV perspective through the Z-axis. Bilinear interpolation is utilized to obtain the dense features from the bird-view feature. Anchor-based approach [13] is used for high-quality 3D proposal generation.

Multi-Scale Keypoint Feature Fusion
The point cloud is divided into voxels to calculate the features of a certain range of points. The semantic information at different scales can be better captured by adopting sparse convolution. After feature FPS voxel set abstraction module, we can obtain the f-keypoints sequence at different layers defined as follows:

(7)

(8)

More specifically, denote $fp_{conv1}$, $fp_{conv2}$, $fp_{conv3}$, $fp_{conv4}$, $fp_{conv}$ as the set of the f-keypointsequence after conv1; conv2; conv3; conv4 and BEV feature abstract. The constraint (8) is satisfied dueto using F-FPS at different feature spaces. $ff_fpi^{(i)}$ indicate the $i^th$layer f-keypoint sequence feature. Based on the difference between sampling points, how to effectively use those keypoints at different layers has become a problem that must be considered: (i) Whether the f-keypoint feature of each layer is effective for the point cloud task; To observe this phenomenon more clearly, the quotient is proposed as $\frac{fpingt}{fp}$ denote the number of the total f-keypoint sequence comes from ground-truth. 𝑓𝑝 indicate the number of the fkeypoint sequences equals to 𝑞. Calculating the distribution of the points obtained by F-FPS for each layer in the ground-truth is illustrated in Fig. 3. The abscissa indicates the sampling points using different convolutional layers. It should be noted that “raw” represents the original point that falls in the ground-truth after using laser scanners. “fusionfeature” represents the feature that is downsampled to 128 dimensional after concatenating the features of all layers.

Fig. 3. Different layers distribution of f-keypoint sequence in ground-truth (from left to right and top to bottom). The percentage and number of the f-keypoint falling in the ground-truth box of different objects. All the distance sampling points are 2048.

(ii) How to pick up different layers of the f-keypoint sequence; For different layers of sampling points, using F-FPS after 3D sparse convolution generally obeys the same principle. As illustrated in Fig. 3, under different categories, the number and percentage of the f-keypoint sequence sampled at BEV, conv1, and conv4 layers falling in the ground-truth are greater than other layers. On the contrary, the number of the f-keypoint sequence sampled after “fusionfeature” falling in the ground-truth is the lowest among conv1, conv2, conv3, conv4, and Bev layers. Based on this phenomenon, it is an unwise choice to concatenate each layer’s features to object classification. Does it mean that the points in these layers contribute more to classification and regression? To solve this problem, we propose four different point selection strategies and test them on the validation set which is given in the next consideration. (iii) How to merge the characteristics of different layers of the f-keypoint sequence; Albeit an intuitive solution is to directly combine the features of each layer to provide support for the subsequent object classification and regression as shown in formula (9). However, this solution is not consistent with our original intention of using the minimum number of representative points to represent the semantic information. Based on the above reasons, we consider using the sampling points on BEV, conv1, and conv4 layers. The four fusion strategies are defined as follows: The spatial f-keypoint fusion strategy is formulated as

(9)

The second choice of the f-keypoint fusion strategy is

(10)

The third choice of the f-keypoint fusion strategy is

(11)

Finally, a fourth option that we adopt in this paper is

(12)

The corresponding feature fusion strategy is defined as

(13)

With multi-scale keypoints feature fusion, the whole scene is summarized into two different keypoints sequences with multiscale semantic features. ROI feature abstraction [1] is adopted between two different keypoints sequences for proposal refinement. The refine prediction head adopts MLP from ff and df for confidence prediction and box refinement, respectively

Experiments
3.6.1 Datasets KITTI dataset, released in 2012, is the widely used computer vision algorithm evaluation dataset in the field of autonomous driving. This data set contains multiple tasks such as 3D object detection and multi-object tracking and segmentation. The 3D object detection benchmark consists of 7,481 training images and 7,518 test images as well as the corresponding point clouds. The training samples are widely divided into the train set (3,712 samples) and the validation set (3,769 samples).
3.6.2 Implementation Details We conduct the experiment on a GTX 1080Ti GPU. The 3D sparse voxel CNN has four layers with feature dimensions 16, 32, 64, 64, respectively. The end-to-end network train with a learning rate 0.01 for 80 epochs with batch size 1. For the survival of sampling points at different layers in the ground-truth box, we amplify the absolute coordinate position of the ground-truth box in the scene based on the principle: 2 × 𝐿, 2 × 𝑊, and 2 × 𝐻. The reason is that we should not only focus on the points that fall in the ground-truth box, but we should also pay attention to the points around the ground-truth box to fine-tune the bounding box. For the refine prediction head stage, we use 2-layer MLP for confidence prediction and box regression respectively. Furthermore, we set the number of sampling points as follows 𝑝 = 2,048 and 𝑞 = 960 for the KITTI dataset. By using such numbers, the scene can be better covered with the fewest points and better detection results can be obtained on multiple objects such as cars, pedestrians, and cyclists.
3.6.3 Results To evaluate the performance, we submitted the experimental results on the KITTI 3D Object Detection Benchmark and the KITTI Bird’s eye view (BEV) Object Detection Benchmark. We conducted 3D object detection experiments on cars, pedestrians, and cyclists as indicated in Table 1. Besides, we conducted the BEV object detection experiments on cars and cyclists as illustrated in Table 2. Before submitting the results to the official website, we tested the performance of the most important car class on the KITTI validation split with mean average precision (mAP) with R40 as illustrated in Table 3. More table attribute details are as follows, “Modality” means data sources, “R” means using pictures, “L” means using point clouds. According to the format “Ours-number-method” which means to use the specified number points from F-FPS sampling to evaluate with specified fusion strategy. “U” is the union of mathematics, “^” is a mathematical intersection. The mean average precision with 40 recall positions (mAP) is used to evaluate model performance on three difficulty levels: easy, moderate, and hard. To evaluate the overlap of the boxes, we use the same indicators as the official evaluation. Specifically, for cars, the overlap of the bounding box on easy, moderate, and hard objects requires 70%, 50%, and 50% respectively. For pedestrians and cyclists, the overlap of the bounding box on easy, moderate, and hard objects requires 50%, 25%, and 25% respectively.

Table 1.mAP about 3D object detection performance comparison on the test set
 Method Modality 3DCar(%) 3DPedestrian(%) 3DCyclist(%) Easy Mode-rate Hard Easy Mode-rate Hard Easy Mode-rate Hard MV3D[6] R+L 74.97 63.63 54 - - - - - - F-PointNet[7] R+L 82.19 69.79 60.59 50.53 42.15 38.08 72.27 56.12 49.01 AVOD-FPN[9] R+L 83.07 71.76 65.73 50.46 42.27 39.04 63.76 50.55 44.93 UberATG-MMF[11] R+L 88.4 77.43 70.22 - - - - - - PFF3D [28] R+L 81.11 72.93 67.24 43.93 36.07 32.86 63.27 46.78 41.37 PointPillars [15] L 82.58 74.31 68.99 51.45 41.92 38.89 77.1 58.65 51.92 PointRCNN [18] L 86.96 75.64< 70.7 47.98 39.37 36.01 74.96 58.82 52.53 Point-GNN [22] L 88.33 79.47 72.29 51.92 43.77 40.14 78.6 63.48 57.08 S-AT GCN [34] L 83.2 76.04 71.17 44.63 37.37 34.93 75.24 61.7 55.32 Ours-896-bevUConv1UConv4 L 86.86 78.3 73.8 43.94 36.66 34.56 75.64 59.14 52.97 Ours-960-bevUConv1UConv4 L 87.25 78.3 73.66 44 36.65 34.59 76.68 60.48 54.2 Ours-960-bevUConv1^Conv4 L 85.25 78.4 73.75 46.01 38.05 35.72 78.08 61.8 54.89
The bold number indicates the best result.

Table 2.mAP about BEV object detection performance comparison on the test set
 Method Modality BEVCar(%) BEVCyclist(%) Easy Moderate Hard Easy Moderate Hard MV3D [6] R+L 86.02 76.9 68.49 - - - F-PointNet [7] R+L 88.7 84 75.33 75.38 61.96 54.68 AVOD-FPN [9] R+L 88.53 83.79 77.9 68.09 57.48 50.77 UberATG-MMF [11] R+L 89.49 87.47 79.1 - - - PFF3D [28] R+L 89.61 85.08 80.42 72.67 55.71 49.58 PointPillars [15] L 88.35 86.1 79.83 79.14 62.25 56 PointRCNN[18] L 89.47 85.68 79.1 81.52 66.77 60.78 Point-GNN [22] L 93.11 89.17 83.9 - - - S-AT GCN [34] L 90.85 87.68 84.2 78.53 66.71 60.19 X-view [26] L - - - 81.32 63.06 56.65 Ours-896-bevUConv1UConv4 L 91.62 87.41 84.67 78.95 66.09 59.58 Ours-960-bevUConv1UConv4 L 91.93 87.41 84.65 81.36 66.16 58.94 Ours-960-bevUConv1^Conv4 L 91.85 87.78 84.82 80.98 67.1 60.84

The bold number indicates the best result.

Table 3. mAP about 3D object detection performance comparison on the validation set
 Method Modality Label 3DCar(%) Easy Moderate Hard MV3D [6] R+L two-stage 71.29 62.68 56.56 F-PointNet [7] R+L two-stage 83.76 70.92 63.65 AVOD-FPN [9] R+L two-stage 84.41 74.44 68.65 PFF3D [28] R+L one-stage 88.04 77.6 76.23 Voxelnet [12] L two-stage 81.97 65.46 62.85 PointRCNN [18] L two-stage 88.88 78.63 77.38 Point-GNN [22] L one-stage 87.89 78.34 77.38 Ours L two-stage 89.1 79.13 78.46

The bold number indicates the best result.

We compare the model with other baselines. Tables 1 and 2 show the test set performance of the proposed model and the other baseline models. Table 3 exhibits the validation set performance of the proposed model and the other baseline models. We can see that our model has a better performance in the moderate and hard difficulties. The details are as follows. As it is shown in Table 1, the proposed method achieved 1.9%–2.09% improvement by using strategy in the maximum occlusion level, and maximum truncation. The hard difficulty which uses the union of keypoints set has a 2.09% improvement. As it is shown in Table 2, the BEV object detection performance are compared in car and cyclist category. For the car category, our method accuracy has 0.53%–0.74% improved by using strategy in the hard difficulty. For the cyclist category, our method improves 0.1%– 0.49% in moderate and hard difficulties. Table 3 exhibits 3D object detection performance comparison on the validation set. Our model achieves better performance compared with the state-of-the-art methods in easy/moderate/hard difficulties of car category, which indicates it is a potential solution to moderate and hard difficulties. It is frustrating that pedestrians are difficult to correctly detect. The reason could be that the number of sampling points is too little and the pedestrian bounding box is small. In the scene, pedestrian points often coincide with other semantics.
3.6.4 Ablation studies Two sets of ablation experiments were performed to determine the number of points selected on F-FPS and fusion strategy. The training samples are widely divided into the train set (3,712 samples) and the validation set (3,769 samples). Firstly, we test different feature fusion strategies for the feature furthest point 𝑓𝑝 as illustrated in Table 4. Then the influence of the different sample numbers of feature furthest points 𝑓𝑝 is tested as illustrated in Table 5.

Table 4. mAP about 3D object detection performance comparison on the validation set for different fusion strategies
 Fusion strategy 3DCar(%) 3DPedestrian(%) 3DCyclist(%) Sample number Easy Mode-rate Hard Easy Mode-rate Hard Easy Mode-rate Hard PVRCNN [1] (without F-FPS) 88.96 78.74 78.11 60.83 55.42 51.21 83.52 69.18 66.33 0 bev^conv1^conv4 - - - - - - - - - ~300 bevUconv1Uconv4 88.77 78.87 78.29 60.33 53.7 50.35 87.04 68.46 64.46 ~1300 bevUconv1^conv4 89.02 78.82 78.3 64.71 58.19 53.84 83.29 70.17 65.64 ~1000 Improvement 0.06 0.07 0.19 3.89 2.77 2.64 -0.23 0.99 -0.7

As it is shown in Table 4, to compare the effect of the feature FPS voxel set abstraction module, we first test the methods without F-FPS and compare them on various feature fusion methods. It is worth noting that when using all the calculated intersections of different layers features, the final points can survive around 300 points. The number of sampling points is too small, thus impossible to cover the scene well. For the above reasons, this fusion method is directly abandoned. Compared with the methods without F-FPS (PV-RCNN), the proposed model improve 2.13% between different difficulties. The most intuitive way to select the number of feature keypoints 𝑓𝑝 is half of the distance feature points 𝑑𝑝. We test different number of selected points as it is shown in Table 5, the number of F-FPS points set as 1,024, 960, and 896. Comprehensive consideration of the two indicators of difficulties and accuracy, we choose 960 sampling points for F-FPS.

Conclusion

In this paper, we present a multi-scale keypoints feature fusion object detection model for the point clouds. We introduce a feature FPS set abstraction module to aggregate 3D voxel-wise features above surrounding voxel volume to preserve minimum representative points. After that, a novel multi-scale feature fusion strategy is adopted to get more accurate semantic information. All the above designs enable our model to achieve the leading accuracy in both the 3D object detection and BEV object detection of the KITTI benchmark. For the car category, our method in the hard difficulty (maximum occlusion and truncation) has 1.9%–2.09% improved. For the cyclist category, our method improved 0.1%–0.49% in the moderate and hard difficulties for the BEV object detection. Combine feature FPS set abstraction module and fusion strategy outperform the previous works by at least 2.13%. In the future, it is hopeful to extend the model for better prediction tasks based on the feature fusion structure design.

Acknowledgements

We thank lab members at Chongqing University of Posts and Telecommunications for providing the GPU and technical support. We thank lab members at Inha University for their technical support. We would also like to thank the editors, reviewers, and editorial staff who participated in the publication process of this paper

Author’s Contributions

Conceptualization, XZ, LB. Funding acquisition, XZ. Investigation and methodology, XZ, LB. Writing of the original draft, LB. Writing of the review and editing, XZ, ZZ, YL. All the authors have proofread the final version.

Funding

This work was supported in part by the Key Cooperation Project of Chongqing Municipal Education Commission (No. HZ2021008), in part by the Natural Science Foundation of Chongqing, China (No. cstc2019jscx-mbdxX0021 and cstc2014kjrc-qnrc40002), and in part by the Major Industrial Technology Research and Development Project of Chongqing High-tech Industry (No. D2018-82).

Competing Interests

The authors declare that they have no competing interests.

Author Biography

Name : Xu Zhang
Affiliation : Department of Computer Science and Technology, Chongqing University of Posts and Telecommunications School of Digital Technologies, Tallinn University, Tallinn, Estonia
Biography : Xu Zhang received his Ph.D. in Computer and Information Engineering at Inha University, Incheon, Korea, in 2013. He is currently an associate professor at Chongqing University of Posts and Telecommunications. His research interests include urban computing, intelligent transportation system, and few-shot learning.

Name : Linjuan Bai
Affiliation : Department of Computer Science and Technology, Chongqing University of Posts and Telecommunications
Biography : Linjuan Bai is currently pursuing the M.S. degree from the Department of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing China. Her research interests include data mining and object detection.

Name : Zuyu Zhang
Affiliation : Department of Electrical and Computer Engineering, Inha University
Biography : Zuyu Zhang received his Bachelor’s degree in the Department of Communication and Information Engineering and Master’s degree in the Department of Computer Science and Technology from Chongqing University of Posts and Telecommunications, Chongqing, China, in 2014 and 2019. He is currently pursuing his PhD at Inha University, South Korea. His research areas are: machine learning, deep learning and biomedical image analysis.

Name : Yan Li
Affiliation : Department of Electrical and Computer Engineering, Inha University
Biography : Yan Li received the M.S. and Ph.D. degrees in the Department of Computer Science and Engineering from Inha University, Incheon, Korea, in 2008 and 2016. Now she is an assistant professor in the Department of Electrical and Computer Engineering at Inha University. Her research interests include cloud computing, crowd sensing, big data analytics.

References

[1] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “PV-RCNN: point-voxel feature set abstraction for 3D object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, 2020, pp. 10526-10535.
[2] M. Zhang, R. Fu, Y. Guo, L. Wang, P. Wang, and H. Deng, “Cyclist detection and tracking based on multilayer laser scanner,” Human-centric Computing and Information Sciences, vol. 10, article no. 20, 2020. https://doi.org/10.1186/s13673-020-00225-x
[3] D. Cao, Z. Chen, and L. Gao, “An improved object detection algorithm based on multi-scaled and deformable convolutional neural networks,” Human-centric Computing and Information Sciences, vol. 10, article no. 14, 2020. https://doi.org/10.1186/s13673-020-00219-9
[4] K. Kim and I. Y. Jung, “Secure object detection based on deep learning,” Journal of Information Processing Systems, vol. 17, no. 3, pp. 571-585, 2021.
[5] W. Song, L. Zhang, Y. Tian, S. Fong, J. Liu, and A. Gozho, “CNN-based 3D object classification using Hough space of LiDAR point clouds,” Human-centric Computing and Information Sciences, vol. 10, article no. 19, 2020. https://doi.org/10.1186/s13673-020-00228-8
[6] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object detection network for autonomous driving,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 2017, pp. 6526-6534.
[7] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for 3D object detection from RGB-D data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 918-927.
[8] D. Xu, D. Anguelov, and A. Jain, “PointFusion: deep sensor fusion for 3D bounding box estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, Salt Lake City, UT, pp. 244-253.
[9] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint 3D proposal generation and object detection from view aggregation,” in Proceedings of 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 2018, pp. 1-8.
[10] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “IPOD: intensive point-based object detector for point cloud,” 2018 [Online]. Available: https://arxiv.org/abs/1812.05276.
[11] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, “Multi-task multi-sensor fusion for 3d object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, 2019, pp. 7345-7353.
[12] Y. Zhou and O. Tuzel, “VoxelNet: end-to-end learning for point cloud based 3D object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 4490-4499.
[13] Y. Yan, Y. Mao, and B. Li, “SECOND: sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, article no. 3337, 2018. https://doi.org/10.3390/s18103337
[14] B. Yang, W. Luo, and R. Urtasun, “PIXOR: real-time 3D object detection from point clouds,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 7652-7660.
[15] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: fast encoders for object detection from point clouds,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, 2019, pp. 12697-12705.
[16] S. Shi, Z. Wang, J. Shi, X. Wang, and H. Li, “From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 8, pp. 2647-2664, 2020.
[17] Y. Chen, S. Liu, X. Shen, and J. Jia, “Fast point R-CNN,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 2019, pp. 9774-9783.
[18] S. Shi, X. Wang, and H. Li, “PointRCNN: 3D object proposal generation and detection from point cloud,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, 2019, pp. 770-779.
[19] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “STD: sparse-to-dense 3D object detector for point cloud,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 2019, pp. 1951-1960.
[20] Z. Yang, Y. Sun, S. Liu, and J. Jia, “3DSSD: point-based 3D single stage object detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, 2020, pp. 11037-11045.
[21] J. Zarzar, S. Giancola, and B. Ghanem, “PointRGCN: graph convolution networks for 3D vehicles detection refinement,” 2019 [Online]. Available: https://arxiv.org/abs/1911.12236.
[22] W. Shi and R. Rajkumar, “Point-GNN: graph neural network for 3D object detection in a point cloud,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, 2020, pp. 1708-1716.
[23] Q. He, Z. Wang, H. Zeng, Y. Zeng, S. Liu, and B. Zeng, “SVGA-Net: sparse voxel-graph attention network for 3D object detection from point clouds,” 2020 [Online]. Available: https://arxiv.org/abs/2006.04043.
[24] A. Kishor and C. Chakraborty, “Early and accurate prediction of diabetics based on FCBF feature selection and SMOTE,” International Journal of System Assurance Engineering and Management, 2021. https://doi.org/10.1007/s13198-021-01174-z
[25] R. Dwivedi, S. Dey, C. Chakraborty, and S. Tiwari, “Grape disease detection network based on multi-task learning and attention features,” IEEE Sensors Journal, vol. 21, no. 16, pp. 17573-17580, 2021.
[26] L. Xie, G. Xu, D. Cai, and X. He, “X-view: non-egocentric multi-view 3D object detector,” 2021 [Online]. Available: https://arxiv.org/abs/2103.13001.
[27] J. Wei, M. Xu, and H. Xiu, “A point clouds fast thinning algorithm based on sample point spatial neighborhood,” Journal of Information Processing Systems, vol. 16, no. 3, pp. 688-698, 2020.
[28] L. H. Wen and K. H. Jo, “Fast and accurate 3D object detection for lidar-camera-based autonomous vehicles using one shared voxel-based backbone,” IEEE Access, vol. 9, pp. 22080-22089, 2021.
[29] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: deep learning on point sets for 3D classification and segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 2017, pp. 77-85.
[30] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: deep hierarchical feature learning on point sets in a metric space,” Advances in Neural Information Processing Systems, vol. 30, pp. 5099-5108, 2017.
[31] M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in graph domains,” in Proceedings of 2005 IEEE International Joint Conference on Neural Networks, Montreal, Canada, 2005, pp. 729-734.
[32] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph CNN for learning on point clouds,” ACM Transactions on Graphics, vol. 38, no. 5, article no. 146, 2019. https://doi.org/10.1145/3326362.
[33] C. Kaul, N. Pears, and S. Manandhar, “SAWNet: a spatially aware deep neural network for 3D point cloud processing,” 2019 [Online]. Available: https://arxiv.org/abs/1905.07650.
[34] L. Wang, C. Wang, X. Zhang, T. Lan, and J. Li, “S-AT GCN: spatial-attention graph convolution network based feature enhancement for 3D object detection,” 2021 [Online]. Available: https://arxiv.org/abs/2103.08439

Xu Zhang1,*, Linjuan Bai1, Zuyu Zhang2, and Yan Li2, Multi-Scale Keypoints Feature Fusion Network for 3D Object Detection from Point Clouds, Article number: 12:29 (2022) Cite this article 2 Accesses