arXiv2
Search
Dark
/ Date
/ Name
Aa
W
/ Date
/ Name
"au:"Zhoufutu Wen"" — arXiv2 Search
Showing 1–5 of 5 results
/ Date
/ Name
Apr 23, 2026
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors
Mar 27, 2026
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
Nov 14, 2025
DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains
May 20, 2025
KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation
Feb 20, 2025
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines