arXiv2
Search
Dark
/ Date
/ Name
Aa
W
/ Date
/ Name
"au:"Sami Jawhar"" — arXiv2 Search
Showing 1–5 of 5 results
/ Date
/ Name
Mar 18, 2025
Measuring AI Ability to Complete Long Software Tasks
Mar 21, 2025
HCAST: Human-Calibrated Autonomy Software Tasks
Nov 22, 2024
RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts
Mar 13, 2025
DarkBench: Benchmarking Dark Patterns in Large Language Models
Jun 25, 2025
The Singapore Consensus on Global AI Safety Research Priorities