When Agents Persuade: Rhetoric Generation and Mitigation in LLMs
Summary: arXiv:2603.04636v2 Announce Type: replace
Abstract: Despite their wide-ranging benefits, LLM-based agents deployed in open environments can be exploited to produce manipulative material. In this study, we task LLMs with propaganda objectives and analyze their outputs using two domain-specific models: one that classifies text as propaganda or non-propaganda, and another that detects rhetorical techniques of propaganda (e.g., loaded language, appeals to fear, flag-waving, name-calling). Our findings show that, when prompted, LLMs exhibit propagandistic behaviors and use a variety of rhetorical techniques in doing so. We also explore mitigation via Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and ORPO (Odds Ratio Preference Optimization). We find that fine-tuning significantly reduces their tendency to generate such content, with ORPO proving most effective.
Introduction
In recent years, large language models (LLMs) have gained prominence in various applications, from customer service to content creation. However, their deployment in open environments raises concerns regarding the potential misuse of these models to generate manipulative content. This article delves into the capabilities of LLMs to produce propaganda and the methods to mitigate such behavior.
Understanding Propaganda in LLMs
Propaganda is a form of communication aimed at influencing the attitude of a community toward some cause or position. The unique ability of LLMs to analyze and generate text can be leveraged to create persuasive narratives. In our study, we aimed to uncover the extent to which LLMs can be manipulated to produce propaganda.
Methodology
To explore the propagandistic capabilities of LLMs, we employed two specialized models:
- Propaganda Classifier: This model distinguishes between propaganda and non-propaganda text.
- Rhetorical Technique Detector: This model identifies various rhetorical strategies such as:
- Loaded language
- Appeals to fear
- Flag-waving
- Name-calling
Findings
Our research revealed that LLMs could produce content laden with propagandistic elements when prompted. The use of rhetorical techniques was evident, showcasing how these models can be exploited to sway public opinion. The implications of these findings are significant, especially in the context of misinformation and social influence.
Mitigation Strategies
To address the potential for LLMs to generate manipulative content, we explored several mitigation strategies:
- Supervised Fine-Tuning (SFT): This method involves refining the model based on labeled data to reduce the likelihood of generating propaganda.
- Direct Preference Optimization (DPO): This technique focuses on aligning the model’s outputs with user preferences to discourage propagandistic content.
- Odds Ratio Preference Optimization (ORPO): Our findings indicated that ORPO was the most effective strategy, significantly decreasing the model’s propensity to generate harmful content.
Conclusion
The study highlights the dual-edged nature of LLMs: while they offer significant advantages in text generation, they can also be misused for propaganda purposes. Implementing effective mitigation strategies is crucial for ensuring that these models serve constructive roles in society. As we continue to explore the capabilities of LLMs, ongoing research is needed to balance innovation with ethical considerations.
