Showing 1–20 of 21 results
/ Date/ Name
Mar 12, 2025Generative Frame Sampler for Long Video UnderstandingFeb 9, 2022Image Difference Captioning with Pre-training and Contrastive LearningNov 17, 2022CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained KnowledgeApr 24, 2025TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming VideosFeb 9, 2026TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual CaptionsMay 31, 2024DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language ModelsMay 15, 2023Edit As You Wish: Video Caption Editing with Multi-grained User ControlDec 4, 2023TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingApr 12, 2020YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific VideosOct 8, 2024Temporal Reasoning Transfer from Text to VideoOct 10, 2025Mitigating Overthinking through Reasoning ShapingMay 8, 2025RICo: Refined In-Context Contribution for Automatic Instruction-Tuning Data SelectionOct 23, 2025Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual EvidenceMay 6, 2026DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference CaptioningApr 16, 2024LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?Oct 12, 2025AVoCaDO: An Audiovisual Video Captioner Driven by Temporal OrchestrationApr 21, 2023Rethinking Benchmarks for Cross-modal Image-text RetrievalApr 7, 2026Claw-Eval: Towards Trustworthy Evaluation of Autonomous AgentsJun 24, 2024UBiSS: A Unified Framework for Bimodal Semantic Summarization of VideosMay 28, 2025RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction