Efficient Fourier Filtering Network With Contrastive Learning for AAV-Based Unaligned Bimodal Salient Object Detection
/ Authors
/ Abstract
Autonomous aerial vehicle (AAV)-based bimodal salient object detection (BSOD) aims to segment salient objects in a scene utilizing complementary cues in unaligned RGB and thermal image pairs. However, the high computational expense of existing AAV-based BSOD models limits their applicability to real-world AAV devices. To address this problem, we propose an efficient Fourier filter network with contrastive learning (CL) that achieves both real time and accurate performance. Specifically, we first design a semantic contrastive alignment loss (SCAL) to align the two modalities at the semantic level, which facilitates mutual refinement in a parameter-free way. Second, inspired by the fast Fourier transform (FFT) that obtains global relevance in linear complexity, we propose synchronized alignment fusion (SAF), which aligns and fuses bimodal features in the channel and spatial dimensions by a hierarchical filtering mechanism. Our proposed model, AlignSal, reduces the number of parameters (Params) by 70.0%, decreases the floating point operations (FLOPs) by 49.4%, and increases the inference speed by 152.5% compared to the cutting-edge BSOD model [i.e., modality registration and object search framework (MROS)]. Extensive experiments on the AAV RGB-T 2400 and seven bimodal dense prediction datasets demonstrate that AlignSal achieves both real-time inference speed and better performance and generalizability compared to 19 state-of-the-art models across most evaluation metrics. In addition, our ablation studies further verify AlignSal’s potential in boosting the performance of existing aligned BSOD models on AAV-based unaligned data. The code is available at https://github.com/JoshuaLPF/AlignSal
Journal: IEEE Transactions on Geoscience and Remote Sensing