Showing 1–16 of 16 results
/ Date/ Name
Dec 2, 2024Ponder & Press: Advancing Visual GUI Agent towards General Computer ControlJun 30, 2024Hierarchical Memory for Long Video QAJun 12, 2024Flash-VStream: Memory-Based Real-Time Understanding for Long Video StreamsJun 30, 2025Flash-VStream: Efficient Real-Time Understanding for Long Video StreamsNov 24, 2024Self-Calibrated CLIP for Training-Free Open-Vocabulary SegmentationAug 6, 2025Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video ReasoningDec 15, 2024Uni-AdaFocus: Spatial-temporal Dynamic Computation for Video RecognitionDec 6, 2025VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement LearningMay 20, 2025UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement LearningNov 3, 2025SEPS: Semantic-enhanced Patch Slimming Framework for fine-grained cross-modal alignmentFeb 6, 2026ChatUMM: Robust Context Tracking for Conversational Interleaved GenerationDec 23, 2025DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual SegmentationJan 29, 2026Adaptive Confidence Gating in Multi-Agent Collaboration for Efficient and Optimized Code GenerationNov 24, 2025Vidi2.5: Large Multimodal Models for Video Understanding and CreationNov 26, 2025Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-TuningApr 20, 2023PREIM3D: 3D Consistent Precise Image Attribute Editing from a Single Image