12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation
In a groundbreaking study recently published on arXiv, researchers have explored a unique intersection of artificial intelligence and film narrative by reimagining the classic courtroom drama “12 Angry Men” as a multi-agent deliberative process involving large language models (LLMs). The paper, titled “12 Angry AI Agents,” aims to investigate how these AI models simulate jury deliberation and decision-making.
The premise of the study revolves around the theoretical question: What if the twelve jurors were not men, but AI agents conditioned to emulate the characters from the film? The researchers designed a benchmark scenario where twelve AI agents, each representing a film-faithful persona, engage in a debate over the murder case depicted in the 1957 classic. This innovative set-up allowed for a comparative analysis of two distinct AI models at opposite ends of the Reinforcement Learning from Human Feedback (RLHF) spectrum: GPT-4o, characterized by closed-source heavy alignment, and Llama-4-Scout, known for its open-weight lighter alignment.
The study employed three experimental conditions: a baseline, an open-minded prompt, and a no-initial-vote scenario, with a total of 18 runs conducted across the different setups. The findings yielded significant insights into the deliberative capabilities of these AI models.
- Hung Jury Dominance: Remarkably, seventeen out of eighteen runs resulted in a hung jury, illustrating a failure to achieve a unanimous verdict. This outcome suggests that the AI models struggled with the central narrative element of gradual minority-to-majority persuasion, with anchoring identified as the main failure mode.
- Divergent Internal Dynamics: The two models exhibited markedly different dynamics during deliberation. GPT-4o consistently produced an average of 1.0 vote change per run across all conditions, while Llama-4-Scout demonstrated a broader range, with vote changes varying from 2.0 in the baseline condition to 6.0 under the open-minded prompt. Notably, Llama-4-Scout was the only model to achieve a NOT_GUILTY verdict in one of the three runs under the no-initial-vote condition.
- RLHF Training Intensity: The asymmetry observed between the two models indicates that the intensity of RLHF alignment training is a critical factor influencing deliberative flexibility in multi-agent contexts. The study posits that flexibility, rather than sheer capability, aligns more closely with human styles of deliberation.
The authors frame this work as an exploratory study, paving the way for further research into the evaluation of jury-like behaviors in LLMs. By utilizing a cinematic narrative as a framework, they offer a novel perspective on how artificial agents can engage in complex decision-making processes akin to human deliberation.
This exploration of AI agents in a jury setting not only highlights the challenges faced by current LLMs in achieving consensus but also raises important questions regarding the implications of AI decision-making in real-world applications. As the field of AI continues to evolve, understanding the nuances of multi-agent interactions will be crucial for developing more sophisticated and adaptable systems.
Related AI Insights
- CyberAId: AI Cybersecurity for Financial Services
- Sheaf-Theoretic Planning for Resilient Multi-Agent Systems
- MAP-Law: Efficient Retrieval for Multi-Turn Legal Consultations
- Top 40-Inch TVs of 2026: Expert Reviews & Buying Guide
- Moira: Language-Driven HRL for Optimized Pair Trading
- Enhancing Multi-Hop Reasoning with Structural Causal Models
- DataEvolver: AI-Driven Visual Data Generation & Improvement
- CoFlow: Efficient Multi-Agent Coordination in Offline Decision-Making
- MILD System: Enhancing Human-Vehicle Collaboration Safety
- Runtime Evaluation of PCG in Endless Runner Games
