UNION: A Lightweight Target Representation for Efficient Zero-Shot Image-Guided Retrieval with Optional Textual Queries
/ Authors
/ Abstract
Image-Guided Retrieval with Optional Text (IGROT) is a general retrieval setting where a query consists of an anchor image, with or without accompanying text, aiming to retrieve semantically relevant target images. This formulation unifies two major tasks: Composed Image Retrieval (CIR) and Sketch-Based Image Retrieval (SBIR). In this work, we address IGROT under low-data supervision by introducing UNION, a lightweight and generalizable target representation that fuses the image embedding with a null-text prompt. Unlike traditional approaches that rely on fixed target features, UNION enhances semantic alignment with multimodal queries while requiring no architectural modifications to pretrained vision-language models. With only 5,000 training samples—from LlavaSCo for CIR and Training-Sketchy for SBIR—our method achieves competitive results across benchmarks, including CIRCO mAP@50 of 38.5 and Sketchy mAP@200 of 82.7, surpassing many heavily supervised baselines. This demonstrates the robustness and efficiency of UNION in bridging vision and language across diverse query types. Our dataset and code are available at https://github.com/baohl00/UNION_for_IGROT.
Journal: 2025 IEEE International Conference on Data Mining Workshops (ICDMW)