Showing 1–20 of 24 results
/ Date/ Name
Nov 6, 2023Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona ModulationJun 17, 2024Adversarial Perturbations Cannot Reliably Protect Artists From Generative AIJun 14, 2022Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINOOct 27, 2023Personas as a Way to Model Truthfulness in Language ModelsOct 4, 2024Gradient-based Jailbreak Images for Multimodal Fusion ModelsNov 15, 2024Measuring Non-Adversarial Reproduction of Training Data in Large Language ModelsJun 2, 2023PassGPT: Password Modeling and (Guided) Generation with Large Language ModelsApr 10, 2022"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial AttacksNov 24, 2023Universal Jailbreak Backdoors from Poisoned Human FeedbackOct 3, 2022Red-Teaming the Stable Diffusion Safety FilterApr 22, 2024Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMsJun 12, 2024Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag CompetitionOct 17, 2024Persistent Pre-Training Poisoning of LLMsFeb 4, 2025Adversarial ML Problems Are Getting Harder to Solve and to EvaluateSep 26, 2024An Adversarial Perspective on Machine Unlearning for AI SafetyApr 3, 2024Attributions toward Artificial Agents in a modified Moral Turing TestJan 12, 2026Representations of Text and Images Align From Layer OneOct 8, 2025Poisoning Attacks on LLMs Require a Near-constant Number of Poison SamplesJul 27, 2023Open Problems and Fundamental Limitations of Reinforcement Learning from Human FeedbackApr 15, 2024Foundational Challenges in Assuring Alignment and Safety of Large Language Models