Showing 1–12 of 12 results
/ Date/ Name
Jan 12, 2026Representations of Text and Images Align From Layer OneFeb 4, 2025Adversarial ML Problems Are Getting Harder to Solve and to EvaluateNov 15, 2024Measuring Non-Adversarial Reproduction of Training Data in Large Language ModelsOct 17, 2024Persistent Pre-Training Poisoning of LLMsOct 4, 2024Gradient-based Jailbreak Images for Multimodal Fusion ModelsSep 26, 2024An Adversarial Perspective on Machine Unlearning for AI SafetyJun 17, 2024Adversarial Perturbations Cannot Reliably Protect Artists From Generative AIJun 12, 2024Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag CompetitionApr 22, 2024Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMsApr 15, 2024Foundational Challenges in Assuring Alignment and Safety of Large Language ModelsNov 24, 2023Universal Jailbreak Backdoors from Poisoned Human FeedbackOct 3, 2022Red-Teaming the Stable Diffusion Safety Filter