"au:"Ge Zhang"" — arXiv2 Search

/ Date/ Name

/ Date/ Name

"au:"Ge Zhang"" — arXiv2 Search

Showing 1–20 of 24 results

/ Date/ Name

Mar 27, 2026Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation Jan 9, 2026The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning Nov 14, 2025DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains Sep 30, 2025Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation Sep 4, 2025Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?May 29, 2025ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding May 20, 2025KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation Apr 10, 2025Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning Feb 20, 2025SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines Jun 21, 2024GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models Jun 11, 2024McEval: Massively Multilingual Code Evaluation Feb 19, 2024AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling Dec 26, 2023Align on the Fly: Adapting Chatbot Behavior to Established Norms Nov 10, 2023Vibrational Properties of One-Dimensional Disordered Hyperuniform Atomic Chains Oct 13, 2023LRRU: Long-short Range Recurrent Updating Networks for Depth Completion Oct 1, 2023TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks Aug 30, 2023Fragment and Integrate Network (FIN): A Novel Spatial-Temporal Modeling Based on Long Sequential Behavior for Online Food Ordering Click-Through Rate Prediction Jul 12, 2023SoK: Comparing Different Membership Inference Attacks with a Comprehensive Benchmark Jun 18, 2023MARBLE: Music Audio Representation Benchmark for Universal Evaluation May 28, 2022Rayleigh-Taylor instability under multi-mode perturbation: discrete Boltzmann modeling with tracers