Showing 1–20 of 20 results
/ Date/ Name
Jan 12, 2026Representations of Text and Images Align From Layer OneOct 8, 2025Poisoning Attacks on LLMs Require a Near-constant Number of Poison SamplesFeb 4, 2025Adversarial ML Problems Are Getting Harder to Solve and to EvaluateNov 15, 2024Measuring Non-Adversarial Reproduction of Training Data in Large Language ModelsOct 17, 2024Persistent Pre-Training Poisoning of LLMsOct 4, 2024Gradient-based Jailbreak Images for Multimodal Fusion ModelsSep 26, 2024An Adversarial Perspective on Machine Unlearning for AI SafetyJun 17, 2024Adversarial Perturbations Cannot Reliably Protect Artists From Generative AIJun 12, 2024Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag CompetitionApr 22, 2024Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMsApr 15, 2024Foundational Challenges in Assuring Alignment and Safety of Large Language ModelsApr 3, 2024Attributions toward Artificial Agents in a modified Moral Turing TestNov 24, 2023Universal Jailbreak Backdoors from Poisoned Human FeedbackNov 6, 2023Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona ModulationOct 27, 2023Personas as a Way to Model Truthfulness in Language ModelsJul 27, 2023Open Problems and Fundamental Limitations of Reinforcement Learning from Human FeedbackJun 2, 2023PassGPT: Password Modeling and (Guided) Generation with Large Language ModelsOct 3, 2022Red-Teaming the Stable Diffusion Safety FilterJun 14, 2022Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINOApr 10, 2022"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks