Showing 1–17 of 17 results
/ Date/ Name
Sep 21, 2022Toy Models of SuperpositionAug 23, 2022Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons LearnedOct 20, 2023Towards Understanding Sycophancy in Language ModelsFeb 15, 2023The Capacity for Moral Self-Correction in Large Language ModelsApr 10, 2019The Spinful Large Charge Sector of Non-Relativistic CFTs: From Phonons to Vortex CrystalsNov 4, 2022Measuring Progress on Scalable Oversight for Large Language ModelsJun 14, 2024Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language ModelsJun 17, 2013A gauge theory generalization of the fermion-doubling theoremApr 12, 2022Training a Helpful and Harmless Assistant with Reinforcement Learning from Human FeedbackDec 15, 2022Constitutional AI: Harmlessness from AI FeedbackOct 20, 2023Specific versus General Principles for Constitutional AIOct 28, 2024Sabotage Evaluations for Frontier ModelsJul 11, 2022Language Models (Mostly) Know What They KnowDec 6, 2023Evaluating and Mitigating Discrimination in Language Model DecisionsFeb 15, 2022Predictability and Surprise in Large Generative ModelsDec 19, 2022Discovering Language Model Behaviors with Model-Written EvaluationsJan 10, 2024Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training