Showing 1–20 of 29 results
/ Date/ Name
Aug 10, 2022Exploring Anchor-based Detection for Ego4D Natural Language QueryOct 20, 2023Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open WorldsMar 14, 2024UniCode: Learning a Unified Codebook for Multimodal Large Language ModelsMar 10, 2025Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue LearningDec 15, 2025Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human VideosFeb 10, 2026Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and RegularizationAug 11, 2025Being-M0.5: A Real-Time Controllable Vision-Language-Motion ModelApr 30, 2026Being-H0.7: A Latent World-Action Model from Egocentric VideosJul 21, 2025Being-H0: Vision-Language-Action Pretraining from Large-Scale Human VideosMar 9, 2024POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View WorldOct 3, 2024From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual ModalitiesMay 28, 2024Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions?Oct 4, 2024Scaling Large Motion Models with Million-Level Human MotionsNov 25, 2024VideoOrion: Tokenizing Object Dynamics in VideosDec 14, 2025Robust Motion Generation using Part-level Reliable Data from VideosApr 20, 2026Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action ModelsJul 20, 2023No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary DetectionMar 9, 2024SPAFormer: Sequential 3D Part Assembly with TransformersMar 19, 2025EgoDTM: Towards 3D-Aware Egocentric Video-Language PretrainingJun 30, 2025Unified Multimodal Understanding via Byte-Pair Visual Encoding