M3CS: Multi-Target Masked Point Modeling With Learnable Codebook and Siamese Decoders
/ Authors
/ Abstract
Masked point modeling has become a promising scheme of self-supervised pre-training for point clouds. Existing methods reconstruct either the masked points or related features as the objective of pre-training. However, considering the diversity of downstream tasks, it is necessary for the model to have both low- and high-level representation modeling capabilities during pre-training. It enables the capture of both geometric details and semantic contexts. To this end, M3CS is proposed to endow the model with the above abilities. Specifically, with the masked point cloud as input, M3CS introduces two decoders to reconstruct masked representations and the masked points simultaneously. While an extra decoder doubles parameters for the decoding process and may lead to overfitting, we propose siamese decoders to keep the number of learnable parameters unchanged. Further, we propose an online codebook projecting continuous tokens into discrete ones before reconstructing masked points. In such a way, we can compel the decoder to take effect through the combinations of tokens rather than remembering each token. Comprehensive experiments show that M3CS achieves superior performance across both classification and segmentation tasks, outperforming existing methods that are also single-modality and single-scale.
Journal: IEEE Transactions on Circuits and Systems for Video Technology