MiMo-Embodied: X-Embodied Foundation Model Technical Report
/ Authors
Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shu-Qin Ren, Xian-Min Meng
and 34 more authors
Yuchen Zhang, Jingyu Wu, Jinghui Lu, Chenxu Dang, Jiayi Guan, Jianhua Wu, Zhiyi Hou, Hanbing Li, Shumeng Xia, Min-Gu Zhou, Yinan Zheng, Zi-Yang Yue, Shuhao Gu, Hao Tian, Yuan Shen, Jianwei Cui, Wen Zhang, Shaoqing Xu, Bing-Song Wang, Haiyang Sun, Zeyu Zhu, Yun Jiang, Zibin Guo, C. Gong, Chao-Yue Zhang, Wenbo Ding, Kun Ma, Guang Chen, R. Cai, Diyun Xiang, Hengxu Qu, Fuli Luo, Hangjun Ye, Long Chen
/ Abstract
We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.
Journal: ArXiv