Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

/ Authors

Yinghui Li, Jiayi Kuang, Peng Xing, Daixian Liu, Yongheng Zhang, Jun Dong, Shu-Yu Guo, Yangning Li, Qingyu Zhou, Wenhao Jiang

and 4 more authors

Hai-Tao Zheng, Ying Shen, Liang Lin, Philip S. Yu

/ Abstract

Multimodal large language models (MLLMs) perform strongly on natural images, yet their ability to understand discrete visual symbols remains unclear. We present a multi-domain benchmark spanning language, culture, mathematics, physics and chemistry, organized into three cognitive levels: perception and recognition, combination and reasoning, and association and critical thinking. Across leading MLLMs, we observe a consistent cognitive mismatch. Models frequently underperform on elementary symbol recognition while appearing relatively competent on more complex reasoning tasks. This recognition-reasoning inversion indicates that current systems often compensate with linguistic priors, template retrieval or procedural reasoning instead of robust visual grounding. The pattern is especially clear for sparse, low-redundancy symbols such as handwritten characters, formula graphs, circuit diagrams and chemical structures. These results show that symbolic understanding remains a major bottleneck for multimodal intelligence and motivate training and evaluation schemes that prioritize grounded perception in discrete semantic spaces.