Showing 361–380 of 1,726 results
/ Date/ Name
Nov 11, 2025Automatic Paper Reviewing with Heterogeneous Graph Reasoning over LLM-Simulated Reviewer-Author DebatesNov 10, 2025NiuTrans.LMT: Toward Inclusive and Scalable Multilingual Machine Translation with LLMsNov 9, 2025How Well Do LLMs Understand Drug Mechanisms? A Knowledge + Reasoning Evaluation DatasetNov 5, 2025PLLuM: A Family of Polish Large Language ModelsNov 4, 2025LTD-Bench: Evaluating Large Language Models by Letting Them DrawNov 1, 2025PADBen: A Comprehensive Benchmark for Evaluating AI Text Detectors Against Paraphrase AttacksOct 31, 2025Atlas-Alignment: Making Interpretability Transferable Across Language ModelsOct 31, 2025Identifying the Periodicity of Information in Natural LanguageOct 30, 2025Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF BenchmarkOct 30, 2025ORBIT -- Open Recommendation Benchmark for Reproducible Research with Hidden TestsOct 29, 2025NeuronMLP: Efficient LLM Inference via Singular Value Decomposition Compression and Tiling on AWS TrainiumOct 29, 2025The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task ExecutionOct 28, 2025Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and CulturesOct 28, 2025GraphNet: A Large-Scale Computational Graph Dataset for Tensor Compiler ResearchOct 27, 2025Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time PerceptionOct 27, 2025PTPP-Aware Adaptation Scaling Laws: Predicting Domain-Adaptation Performance at Unseen Pre-Training BudgetsOct 26, 2025MMPersuade: A Dataset and Evaluation Framework for Multimodal PersuasionOct 24, 2025The Universal Landscape of Human ReasoningOct 24, 2025When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning ModelsOct 23, 2025Why Did Apple Fall: Evaluating Curiosity in Large Language Models