Multiple-Exit Tuning: Towards Inference-Efficient Adaptation for Vision Transformer
/ Authors
/ Abstract
Parameter-efficient transfer learning (PETL) has shown great potential in adapting vision transformers (ViTs) pre-trained on large-scale datasets to various downstream tasks. Existing studies primarily focus on minimizing the number of learnable parameters. Although these methods are storage-efficient, they allocate excessive computational resources to easy samples, leading to inefficient inference. To address this issue, we introduce an inference-efficient tuning method termed multiple-exit tuning (MET). MET integrates multiple exits into the pre-trained ViT backbone. Since the predictions of the adapted ViTs are typically made by linear classifiers, each exit is equipped with a linear prediction head. During inference, easy samples exit at early exits and only hard enough samples flow to the final exit, thus reducing the computational cost for easy samples. MET consists of exit-specific adapters (E-adapters) and graph regularization. E-adapters are designed to extract suitable representations for different exits. To maintain parameter efficiency, all E-adapters share the same base down-projection and up-projection matrices. As the performances of linear classifiers are influenced by the relationships among samples, we employ graph regularization to improve the representations fed into the classifiers at early exits. We conduct extensive experiments to evaluate the performance of MET. Experimental results show that MET has a clear advantage over the state-of-the-art methods in terms of both accuracy and inference efficiency.
Journal: IEEE Transactions on Circuits and Systems for Video Technology