Showing 1–14 of 14 results
/ Date/ Name
Mar 20, 2026MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual HintsNov 20, 2025Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual GenerationOct 30, 2025Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF BenchmarkJun 5, 2025MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought ReasoningMay 11, 2025Seed1.5-VL Technical ReportFeb 13, 2025MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and EfficiencyMay 23, 2024TerDiT: Ternary Diffusion Models with TransformersFeb 8, 2024SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language ModelsNov 13, 2023SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language ModelsJun 15, 2023Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language ModelsApr 28, 2023LLaMA-Adapter V2: Parameter-Efficient Visual Instruction ModelMar 9, 2023Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature MimickingAug 6, 2022Frozen CLIP Models are Efficient Video LearnersNov 18, 2020End-to-End Object Detection with Adaptive Clustering Transformer