Showing 1–20 of 24 results
/ Date/ Name
Mar 27, 2026Xpertbench: Expert Level Tasks with Rubrics-Based EvaluationJan 9, 2026The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought ReasoningNov 14, 2025DiscoX: Benchmarking Discourse-Level Translation task in Expert DomainsSep 30, 2025Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget AllocationSep 4, 2025Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?May 29, 2025ScaleLong: A Multi-Timescale Benchmark for Long Video UnderstandingMay 20, 2025KORGym: A Dynamic Game Platform for LLM Reasoning EvaluationApr 10, 2025Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement LearningFeb 20, 2025SuperGPQA: Scaling LLM Evaluation across 285 Graduate DisciplinesJun 21, 2024GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language ModelsJun 11, 2024McEval: Massively Multilingual Code EvaluationFeb 19, 2024AnyGPT: Unified Multimodal LLM with Discrete Sequence ModelingDec 26, 2023Align on the Fly: Adapting Chatbot Behavior to Established NormsNov 10, 2023Vibrational Properties of One-Dimensional Disordered Hyperuniform Atomic ChainsOct 13, 2023LRRU: Long-short Range Recurrent Updating Networks for Depth CompletionOct 1, 2023TIGERScore: Towards Building Explainable Metric for All Text Generation TasksAug 30, 2023Fragment and Integrate Network (FIN): A Novel Spatial-Temporal Modeling Based on Long Sequential Behavior for Online Food Ordering Click-Through Rate PredictionJul 12, 2023SoK: Comparing Different Membership Inference Attacks with a Comprehensive BenchmarkJun 18, 2023MARBLE: Music Audio Representation Benchmark for Universal EvaluationMay 28, 2022Rayleigh-Taylor instability under multi-mode perturbation: discrete Boltzmann modeling with tracers