Text-Promptable Propagation for Referring Medical Image Sequence Segmentation
/ Authors
/ Abstract
Referring Medical Image Sequence Segmentation (Ref-MISS) is a novel and challenging task that aims to segment anatomical structures in medical image sequences (e.g., endoscopy, ultrasound, CT, and MRI) based on natural language descriptions. Existing 2D and 3D segmentation models struggle to explicitly track objects of interest across medical image sequences, and lack support for interactive, text-driven guidance. To address these limitations, we propose Text-Promptable Propagation (TPP), which enables the recognition of referred objects through cross-modal referring interaction, and maintains continuous tracking across the sequence via Transformer-based triple propagation, using text embeddings as queries. To support this task, we curate a large-scale benchmark, Ref-MISS-Bench, which covers 4 imaging modalities and 20 different organs and lesions. Experimental results on this benchmark demonstrate that TPP consistently outperforms state-of-the-art methods in both medical segmentation and referring video object segmentation. Code and data are available at https://github.com/yuanruntian/TPP.
Journal: Proceedings of the 33rd ACM International Conference on Multimedia