"au:"Shauna Kravec"" — arXiv2 Search

/ Date/ Name

/ Date/ Name

"au:"Shauna Kravec"" — arXiv2 Search

Showing 1–17 of 17 results

/ Date/ Name

Sep 21, 2022Toy Models of Superposition Aug 23, 2022Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Oct 20, 2023Towards Understanding Sycophancy in Language Models Feb 15, 2023The Capacity for Moral Self-Correction in Large Language Models Apr 10, 2019The Spinful Large Charge Sector of Non-Relativistic CFTs: From Phonons to Vortex Crystals Nov 4, 2022Measuring Progress on Scalable Oversight for Large Language Models Jun 14, 2024Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models Jun 17, 2013A gauge theory generalization of the fermion-doubling theorem Apr 12, 2022Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Dec 15, 2022Constitutional AI: Harmlessness from AI Feedback Oct 20, 2023Specific versus General Principles for Constitutional AI Oct 28, 2024Sabotage Evaluations for Frontier Models Jul 11, 2022Language Models (Mostly) Know What They Know Dec 6, 2023Evaluating and Mitigating Discrimination in Language Model Decisions Feb 15, 2022Predictability and Surprise in Large Generative Models Dec 19, 2022Discovering Language Model Behaviors with Model-Written Evaluations Jan 10, 2024Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training