Seeing Radio: From Zero RF Priors to Explainable Modulation Recognition with Vision Language Models
/ Authors
/ Abstract
Current RF machine-learning pipelines rely on task-specific deep networks for modulation classification and related tasks, but these models require custom architectures and labeled datasets for each problem, generalize poorly across channel conditions and SNRs, and offer little interpretability. In contrast, modern multimodal large language models (MLLMs) can integrate heterogeneous visual and textual data and exhibit strong cross-domain generalization and explanation capabilities. Our goal in this work is to explore whether vision-language models (VLMs) can be adapted to directly perceive RF signals and reason about modulation patterns without redesigning their architectures or injecting RF-specific inductive biases. To achieve this, we convert complex IQ streams into time-series, spectrogram, and joint RF visualizations, build a 57-class RF visual question answering benchmark, and show that lightweight parameter-efficient fine-tuning can enhance the accuracy of a general-purpose VLM from around 10% to nearly 90%, while ensuring robustness to noise and out-of-vocabulary modulations and the ability to produce human-readable rationales. The obtained results show that combining RF-to-image conversion with promptable VLMs provides a scalable and practical foundation for RF-aware AI systems in future 6G networks.