Do LLMs Misjudge Entertainment News Credibility?

Are LLMs More Skeptical of Entertainment News?

In recent years, large language models (LLMs) have gained prominence in various applications, including automated news credibility assessment. However, a significant question arises: do these models apply consistent standards across different journalistic genres, particularly between hard news and entertainment news? A new study published on arXiv (arXiv:2605.01727v1) investigates this issue, focusing on whether zero-shot LLMs are more prone to misclassifying genuine entertainment news as fake compared to legitimate hard news.

Research Overview

The study employs a within-dataset design using GossipCop from FakeNewsNet, a platform known for its efforts to verify the authenticity of news articles. The researchers analyzed four frontier models—DeepSeek-V3.2, GPT-5.2, Claude Opus 4.6, and Gemini 3 Flash—to discern any notable differences in their false-positive rates when evaluating entertainment versus hard news.

Key Findings

Model-Specific Genre Asymmetry: The study reveals that models such as DeepSeek-V3.2 and GPT-5.2 exhibit significant gaps in false-positive rates. Specifically, DeepSeek-V3.2 shows a 10.1 percentage point gap, while GPT-5.2 shows an 8.8 percentage point gap (both with $p < .001$).
No Comparable Difference: In contrast, Claude Opus 4.6 and Gemini 3 Flash did not display similar discrepancies, indicating that the degree of skepticism towards entertainment news varies by model.
Style-Swap Experiment Insights: When researchers conducted a style-swap experiment, they observed only limited and inconsistent changes in the models’ classifications. This suggests that the genre-based asymmetry is not solely attributable to stylistic differences.
Prompt-Based Mitigation: The study also explored the possibility of reducing false positives through prompt adjustments. For instance, framing DeepSeek-V3.2 as an entertainment-news fact-checker decreased false positives by approximately 50% without compromising recall. However, this approach yielded minimal improvement for GPT-5.2.

Qualitative Insights

Beyond quantitative analysis, exploratory qualitative coding of the false positives revealed two recurring error patterns:

Treating Private-Life Claims as Inherently Unverifiable: Many models appeared to question the validity of claims related to the private lives of entertainment figures, viewing them as unverifiable.
Discounting Entertainment Journalism: The models tended to categorize entertainment journalism as an epistemically weaker genre, leading to a bias in their assessments.

Implications for Future Assessments

These findings raise crucial considerations regarding the performance metrics used to evaluate LLMs in the context of journalistic genres. The study argues that aggregate performance metrics can mask structured false positives in legitimate journalism, highlighting the need for a more nuanced evaluation approach. Specifically, it suggests that credibility assessments should incorporate genre-stratified false-positive analysis alongside overall accuracy to better understand how LLMs differentiate between various types of news.

As LLMs continue to shape the landscape of news consumption and credibility assessment, understanding their biases and limitations becomes increasingly important. This research underscores the necessity for developers and researchers to refine these models, ensuring that they uphold journalistic integrity across all genres.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Do LLMs Misjudge Entertainment News Credibility?

Are LLMs More Skeptical of Entertainment News?

Research Overview

Key Findings

Qualitative Insights

Implications for Future Assessments

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related