CARO: Chain-of-Analogy Reasoning Optimization for Robust Content Moderation
Summary: arXiv:2604.10504v1 Announce Type: new
Abstract: Current large language models (LLMs), even those explicitly trained for reasoning, often struggle with ambiguous content moderation cases due to misleading “decision shortcuts” embedded in context. Inspired by cognitive psychology insights into expert moderation, we introduce CARO (Chain-of-Analogy Reasoning Optimization), a novel two-stage training framework to induce robust analogical reasoning in LLMs.
In the evolving landscape of artificial intelligence, content moderation has emerged as a critical area where the effectiveness of large language models (LLMs) is put to the test. These models are designed to interpret and moderate content across various platforms, yet they frequently encounter challenges when dealing with ambiguous scenarios. This limitation can largely be attributed to misleading decision shortcuts that are often embedded within the context of the content being evaluated.
To address these challenges, researchers have drawn inspiration from cognitive psychology, particularly the insights gained from the study of expert moderation practices. This led to the development of CARO (Chain-of-Analogy Reasoning Optimization), a pioneering two-stage training framework aimed at enhancing analogical reasoning within LLMs.
Two-Stage Training Framework
The CARO framework consists of two primary stages:
- Bootstrapping Analogical Reasoning Chains: The first stage employs retrieval-augmented generation (RAG) techniques applied to moderation data. This approach facilitates the creation of analogical reasoning chains that are then subjected to supervised fine-tuning (SFT).
- Customized Direct Preference Optimization: The second stage introduces a direct preference optimization (DPO) strategy tailored to reinforce analogical reasoning behaviors explicitly. This method stands out by dynamically generating context-specific analogical references during inference, thereby reducing the risk of harmful decision shortcuts.
Performance and Results
Extensive experiments conducted to evaluate the efficacy of CARO have yielded promising results. The framework demonstrates a significant performance advantage over existing state-of-the-art reasoning models, including DeepSeek R1 and QwQ, as well as specialized moderation models like LLaMA Guard. Additionally, CARO surpasses advanced fine-tuning and retrieval-augmented methods, achieving an impressive average F1 score improvement of 24.9% on challenging ambiguous moderation benchmarks.
This substantial improvement highlights CARO’s potential to enhance the decision-making capabilities of LLMs in content moderation tasks. By fostering robust analogical reasoning, CARO not only mitigates the pitfalls associated with misleading shortcuts but also sets a new standard for the development of future content moderation frameworks.
Conclusion
The introduction of CARO marks a significant advancement in the field of artificial intelligence and content moderation. By leveraging insights from cognitive psychology and implementing a two-stage training framework, CARO effectively equips LLMs with the tools necessary for improved reasoning in ambiguous situations. As the demand for efficient and reliable content moderation continues to grow, frameworks like CARO will play a pivotal role in shaping the future of AI-driven moderation solutions.
