"au:"Meg Tong"" — arXiv2 Search

/ Date/ Name

/ Date/ Name

"au:"Meg Tong"" — arXiv2 Search

Showing 1–8 of 8 results

/ Date/ Name

Mar 14, 2025Auditing language models for hidden objectives Sep 21, 2023The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"Jan 31, 2025Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming Oct 20, 2023Towards Understanding Sycophancy in Language Models Dec 9, 2023Steering Llama 2 via Contrastive Activation Addition Jan 10, 2024Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training Sep 1, 2023Taken out of context: On measuring situational awareness in LLMs Feb 24, 2025Forecasting Rare Language Model Behaviors