GLID: Pre-training a Generalist Encoder-Decoder Vision Model

/ Authors

Jihao Liu, Jinliang Zheng, Yu Liu, Hongsheng Li

/ Abstract

This paper proposes a GeneraLIst encoder-Decoder (GLID) pre-training method for better handling various downstream computer vision tasks. While self-supervised pre-training approaches, e.g., Masked Autoencoder, have shown success in transfer learning, task-specific sub-architectures are still required to be appended for differ-ent downstream tasks, which cannot enjoy the benefits of large-scale pre-training. GLID overcomes this challenge by allowing the pre-trained generalist encoder-decoder to be fine-tuned on various vision tasks with minimal task-specific architecture modifications. In the GLID training scheme, pre-training pretext task and other downstream tasks are modeled as “query-to-answer” problems, including the pre-training pretext task and other downstream tasks. We pre-train a task-agnostic encoder-decoder with query-mask pairs. During fine-tuning, GLID maintains the pre-trained encoder-decoder and queries, only replacing the topmost linear transformation layer with task-specific linear heads. This minimizes the pretrain-finetune architecture inconsis-tency and enables the pre-trained model to better adapt to downstream tasks. GLID achieves competitive performance on various vision tasks, including object detection, image segmentation, pose estimation, and depth estimation, outper-forming or matching specialist models such as Mask2Former, DETR, ViTPose, and BinsFormer.

Journal: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

DOI: 10.1109/CVPR52733.2024.02156