SoftHateBench: Evaluating Moderation Models Against Reasoning-Driven, Policy-Compliant Hostility
/ Authors
/ Abstract
Online hate on social media ranges from overt slurs and threats (hard hate speech ) to soft hate speech: discourse that appears reasonable on the surface but uses framing and value-based arguments to steer audiences toward blaming or excluding a target group. We hypothesize that current moderation systems, largely optimized for surface toxicity cues, are not robust to this reasoning-driven hostility, yet existing benchmarks do not measure this gap systematically. We introduce SoftHateBench, a generative benchmark that produces soft-hate variants while preserving the underlying hostile standpoint. To generate soft hate, we integrate the Argumentum Model of Topics (AMT) and Relevance Theory (RT) in a unified framework: AMT provides the backbone argument structure for rewriting an explicit hateful standpoint into a seemingly neutral discussion while preserving the stance, and RT guides generation to keep the AMT chain logically coherent. The benchmark spans 7 sociocultural domains and 28 target groups, comprising 4,745 soft-hate instances. Evaluations across encoder-based detectors, general-purpose LLMs, and safety models show a consistent drop from hard to soft tiers: systems that detect explicit hostility often fail when the same stance is conveyed through subtle, reasoning-based language. Disclaimer. Contains offensive examples used solely for research.
Journal: Proceedings of the ACM Web Conference 2026