How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models
Summary: This article discusses the findings from research titled “How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models.” This study, as documented in arXiv:2604.00021v1, aims to understand how language models process ethical instructions and the implications of these processes.
Abstract
Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. We conducted over 600 multi-agent simulations across four models (Llama 3.3 70B, GPT-4o mini, Qwen3-Next-80B-A3B, Sonnet 4.5), four ethical instruction formats (none, minimal norm, reasoned norm, virtue framing), and two languages (Japanese, English).
Confirmatory analysis fully replicated the Llama Japanese dissociation pattern from a prior study (BF10 > 10 for all three hypotheses), but none of the other three models reproduced this pattern, establishing it as model-specific. Three new metrics — Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other-Recognition Index (ORI) — revealed four distinct ethical processing types:
- Output Filter (GPT): Safe outputs, no processing.
- Defensive Repetition (Llama): High consistency through formulaic repetition.
- Critical Internalization (Qwen): Deep deliberation, incomplete integration.
- Principled Consistency (Sonnet): Deliberation, consistency, and other-recognition co-occurring.
Key Findings
The central finding of this research indicates an interaction between processing capacity and instruction format. Specifically:
- In low-DD models, the instruction format has no effect on internal processing.
- In high-DD models, reasoned norms and virtue framing produce opposite effects.
Moreover, lexical compliance with ethical instructions did not correlate with any processing metric at the cell level (r = -0.161 to +0.256, all p > .22; N = 24; power limited). This suggests that safety, compliance, and ethical processing are largely dissociable.
Implications
These processing types show structural correspondence to patterns observed in clinical offender treatment, where formal compliance without internal processing is a recognized risk signal. This finding highlights the need for careful consideration when designing ethical instruction formats for language models to ensure that they not only comply with ethical guidelines but also engage in meaningful ethical processing.
Conclusion
Understanding how language models process ethical instructions is crucial for their development and deployment in real-world applications. This study provides valuable insights into the distinct processing types of various models and emphasizes the importance of instruction format in shaping ethical behavior.
