12 AI Agents Simulate Jury Decision-Making in LLM Study

Date:

12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation

In a groundbreaking study recently published on arXiv, researchers have explored a unique intersection of artificial intelligence and film narrative by reimagining the classic courtroom drama “12 Angry Men” as a multi-agent deliberative process involving large language models (LLMs). The paper, titled “12 Angry AI Agents,” aims to investigate how these AI models simulate jury deliberation and decision-making.

The premise of the study revolves around the theoretical question: What if the twelve jurors were not men, but AI agents conditioned to emulate the characters from the film? The researchers designed a benchmark scenario where twelve AI agents, each representing a film-faithful persona, engage in a debate over the murder case depicted in the 1957 classic. This innovative set-up allowed for a comparative analysis of two distinct AI models at opposite ends of the Reinforcement Learning from Human Feedback (RLHF) spectrum: GPT-4o, characterized by closed-source heavy alignment, and Llama-4-Scout, known for its open-weight lighter alignment.

The study employed three experimental conditions: a baseline, an open-minded prompt, and a no-initial-vote scenario, with a total of 18 runs conducted across the different setups. The findings yielded significant insights into the deliberative capabilities of these AI models.

  • Hung Jury Dominance: Remarkably, seventeen out of eighteen runs resulted in a hung jury, illustrating a failure to achieve a unanimous verdict. This outcome suggests that the AI models struggled with the central narrative element of gradual minority-to-majority persuasion, with anchoring identified as the main failure mode.
  • Divergent Internal Dynamics: The two models exhibited markedly different dynamics during deliberation. GPT-4o consistently produced an average of 1.0 vote change per run across all conditions, while Llama-4-Scout demonstrated a broader range, with vote changes varying from 2.0 in the baseline condition to 6.0 under the open-minded prompt. Notably, Llama-4-Scout was the only model to achieve a NOT_GUILTY verdict in one of the three runs under the no-initial-vote condition.
  • RLHF Training Intensity: The asymmetry observed between the two models indicates that the intensity of RLHF alignment training is a critical factor influencing deliberative flexibility in multi-agent contexts. The study posits that flexibility, rather than sheer capability, aligns more closely with human styles of deliberation.

The authors frame this work as an exploratory study, paving the way for further research into the evaluation of jury-like behaviors in LLMs. By utilizing a cinematic narrative as a framework, they offer a novel perspective on how artificial agents can engage in complex decision-making processes akin to human deliberation.

This exploration of AI agents in a jury setting not only highlights the challenges faced by current LLMs in achieving consensus but also raises important questions regarding the implications of AI decision-making in real-world applications. As the field of AI continues to evolve, understanding the nuances of multi-agent interactions will be crucial for developing more sophisticated and adaptable systems.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.