"au:"Songyang Zhang"" — arXiv2 SearchShowing 1–8 of 8 results
/ Date/ Name
Nov 18, 2025ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific ReasoningAug 25, 2025InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and EfficiencyJan 24, 2025Humanity's Last ExamOct 16, 2024ProSA: Assessing and Understanding the Prompt Sensitivity of LLMsMay 20, 2024MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics BenchmarkMar 26, 2024InternLM2 Technical ReportDec 21, 2023T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by StepOct 20, 2023BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues