Showing 1–20 of 21 results
/ Date/ Name
Aug 19, 2025TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head SynthesisJun 1, 2025FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual FusionFeb 18, 2024ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language ModelsJul 8, 2025MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical VideosApr 29, 2024MileBench: Benchmarking MLLMs in Long ContextJun 27, 2024HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at ScaleFeb 16, 2024Humans or LLMs as the Judge? A Study on Judgement BiasesNov 16, 2023HuatuoGPT-II, One-stage Training for Medical Adaption of LLMsDec 16, 2024BlenderLLM: Training Large Language Models for Computer-Aided Design with Self-improvementAug 20, 2024Open-FinLLMs: Open Multimodal Large Language Models for Financial ApplicationsNov 6, 2024Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLMApr 1, 2026Do Phone-Use Agents Respect Your Privacy?Dec 17, 2023Silkie: Preference Distillation for Large Visual Language ModelsSep 17, 2024Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMsSep 4, 2024LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid ArchitectureOct 12, 2024VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models AlignmentNov 23, 2023MLLM-Bench: Evaluating Multimodal LLMs with Per-sample CriteriaJun 22, 2025ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image GenerationAug 22, 2025MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics ProtocolsFeb 20, 2026From Lossy to Verified: A Provenance-Aware Tiered Memory for Agents