TY - JOUR
T1 - Co-Fix3D: Enhancing 3D Object Detection With Collaborative Refinement
AU - Li, Wenxuan
AU - Zou, Qin
AU - Chen, Chi
AU - Du, Bo
AU - Chen, Long
AU - Zhou, Jian
AU - Yu, Hongkai
PY - 2025/1/1
Y1 - 2025/1/1
N2 - 3D object detection in driving scenarios is particularly challenging due to factors such as sensor noise, occlusions, and the inherent sparsity of LiDAR point clouds, which can lead to the loss or incompleteness of key features, in turn affecting perception performance. To address these challenges, we propose Co-Fix3D, an advanced detection framework that integrates Local and Global Enhancement (LGE) modules to refine Bird's Eye View (BEV) features. The LGE module employs Discrete Wavelet Transform (DWT) to refine local features at a fine scale, which helps capture frequency details and subtle variations in the environment, and incorporates an attention mechanism to enhance global feature representations across the entire scene. Moreover, we adopt multi-head LGE modules that each concentrate on targets with varying levels of detection difficulty, further improving our overall perception performance. On the nuScenes dataset, Co-Fix3D achieves a new SOTA performance with 69.4% mAP and 73.5% NDS compared to other competing methods, while on the multimodal benchmark, it achieves 72.3% mAP and 74.7% NDS, respectively.
AB - 3D object detection in driving scenarios is particularly challenging due to factors such as sensor noise, occlusions, and the inherent sparsity of LiDAR point clouds, which can lead to the loss or incompleteness of key features, in turn affecting perception performance. To address these challenges, we propose Co-Fix3D, an advanced detection framework that integrates Local and Global Enhancement (LGE) modules to refine Bird's Eye View (BEV) features. The LGE module employs Discrete Wavelet Transform (DWT) to refine local features at a fine scale, which helps capture frequency details and subtle variations in the environment, and incorporates an attention mechanism to enhance global feature representations across the entire scene. Moreover, we adopt multi-head LGE modules that each concentrate on targets with varying levels of detection difficulty, further improving our overall perception performance. On the nuScenes dataset, Co-Fix3D achieves a new SOTA performance with 69.4% mAP and 73.5% NDS compared to other competing methods, while on the multimodal benchmark, it achieves 72.3% mAP and 74.7% NDS, respectively.
KW - deep learning methods, sensor fusion
KW - Object detection, segmentation and categorization
UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105002391273&origin=inward
UR - https://www.scopus.com/inward/citedby.uri?partnerID=HzOxMe3b&scp=105002391273&origin=inward
U2 - 10.1109/LRA.2025.3555859
DO - 10.1109/LRA.2025.3555859
M3 - Article
SN - 2377-3766
VL - 10
SP - 4970
EP - 4977
JO - IEEE Robotics and Automation Letters
JF - IEEE Robotics and Automation Letters
IS - 5
ER -