arXiv2
Search
Dark
/ Date
/ Name
Aa
W
/ Date
/ Name
"au:"Meg Tong"" — arXiv2 Search
Showing 1–8 of 8 results
/ Date
/ Name
Mar 14, 2025
Auditing language models for hidden objectives
Sep 21, 2023
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Jan 31, 2025
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Oct 20, 2023
Towards Understanding Sycophancy in Language Models
Dec 9, 2023
Steering Llama 2 via Contrastive Activation Addition
Jan 10, 2024
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Sep 1, 2023
Taken out of context: On measuring situational awareness in LLMs
Feb 24, 2025
Forecasting Rare Language Model Behaviors