Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation
Summary: arXiv:2604.06820v1 Announce Type: new
Abstract: Large language models (LLMs) can generate persuasive narratives at scale, raising concerns about their potential use in disinformation campaigns. Assessing this risk ultimately requires understanding how readers receive such content. In practice, however, LLM judges are increasingly used as a low-cost substitute for direct human evaluation, even though whether they faithfully track reader responses remains unclear.
Introduction
The advent of large language models (LLMs) has transformed the landscape of content generation, enabling the creation of persuasive narratives that can be disseminated rapidly and at scale. This capability, while beneficial in many contexts, has raised alarms regarding the potential misuse of LLMs in orchestrating disinformation campaigns. As a result, the pressing challenge lies in adequately assessing the risks associated with LLM-generated content and understanding how such material is received by human readers.
The Role of LLM Judges
In recent practices, LLMs have been employed as judges to evaluate the quality and impact of generated narratives. This methodology presents a low-cost alternative to traditional human evaluation. However, it raises critical questions about the fidelity of these LLM judges in mirroring human reader responses. The extent to which LLMs can accurately gauge the reception of disinformation is still under scrutiny.
Research Methodology
To investigate the alignment between LLM judges and human readers, the authors of the study recast the evaluation process as a proxy-validity problem. They conducted an audit of LLM judges against actual human responses using a comprehensive dataset comprising:
- 290 aligned articles
- 2,043 paired human ratings
- Outputs from eight frontier judges
Key Findings
The analysis provided several key insights into the relationship between LLM judges and human evaluations:
- Persistent Gaps: There were consistent discrepancies between judge and human evaluations across the dataset.
- Harsher Judgments: LLM judges tended to be stricter than human readers, reflecting a more critical stance in their assessments.
- Weak Recovery of Rankings: The ability of LLM judges to replicate human rankings at the item level was notably weak.
- Different Signal Dependence: LLM judges placed greater emphasis on logical rigor while penalizing emotional intensity more severely than human readers.
- Internal Cohesion: Judges exhibited higher agreement with one another than with human responses, indicating a coherent evaluative group.
Implications and Conclusions
The findings suggest that while LLM judges may present as a consistent evaluative framework, their alignment with human perceptions is tenuous at best. The internal agreement among LLM judges does not equate to validity as a proxy for human responses. This underlines the importance of incorporating direct human evaluations in assessing the risks of LLM-generated disinformation, as relying solely on LLM judges could lead to significant misinterpretations of how content is perceived by actual readers.
In conclusion, the study emphasizes the need for a more nuanced understanding of LLM-generated disinformation and the essential role of human evaluation in this landscape. As technology continues to advance, ongoing research and vigilance will be crucial in mitigating the risks associated with disinformation campaigns.
