"au:"Sipeng Zheng"" — arXiv2 Search

/ Date/ Name

/ Date/ Name

"au:"Sipeng Zheng"" — arXiv2 Search

Showing 1–20 of 29 results

/ Date/ Name

Aug 10, 2022Exploring Anchor-based Detection for Ego4D Natural Language Query Oct 20, 2023Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds Mar 14, 2024UniCode: Learning a Unified Codebook for Multimodal Large Language Models Mar 10, 2025Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning Dec 15, 2025Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos Feb 10, 2026Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization Aug 11, 2025Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model Apr 30, 2026Being-H0.7: A Latent World-Action Model from Egocentric Videos Jul 21, 2025Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos Mar 9, 2024POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World Oct 3, 2024From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities May 28, 2024Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions?Oct 4, 2024Scaling Large Motion Models with Million-Level Human Motions Nov 25, 2024VideoOrion: Tokenizing Object Dynamics in Videos Dec 14, 2025Robust Motion Generation using Part-level Reliable Data from Videos Apr 20, 2026Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models Jul 20, 2023No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection Mar 9, 2024SPAFormer: Sequential 3D Part Assembly with Transformers Mar 19, 2025EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining Jun 30, 2025Unified Multimodal Understanding via Byte-Pair Visual Encoding