Showing 1–14 of 14 results
/ Date/ Name
Oct 17, 2024Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the dataFeb 3, 2024Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a BudgetMay 25, 2023Incentivizing Honesty among Competitors in Collaborative Learning and OptimizationDec 20, 2022Human-Guided Fair Classification for Natural Language ProcessingJul 16, 2025ROC-n-reroll: How verifier imperfection affects test-time scalingOct 10, 2021Algorithmic collusion: A critical reviewFeb 9, 2021Measuring Progress in Deep Reinforcement Learning Sample EfficiencyAug 6, 2018Melting Si: beyond density functional theoryAug 4, 2020Forecasting AI Progress: A Research AgendaJun 9, 2025How Benchmark Prediction from Fewer Data Misses the MarkJul 10, 2024Training on the Test Task Confounds Evaluation and EmergenceNov 9, 2023Challenging the Validity of Personality Tests for Large Language ModelsJul 30, 2025Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests insteadJun 9, 2024Whose Preferences? Differences in Fairness Preferences and Their Impact on the Fairness of AI Utilizing Human Feedback