Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
Summary: arXiv:2604.02315v2 Announce Type: replace
Abstract
Standard LLM benchmarks evaluate the assistant turn: the model generates a response to an input, a verifier scores correctness, and the analysis ends. This paradigm leaves unmeasured whether the LLM encodes any awareness of what follows the assistant response. We propose user-turn generation as a probe of this gap: given a conversation context of user query and assistant response, we let a model generate under the user role. If the model’s weights encode interaction awareness, the generated user turn will be a grounded follow-up that reacts to the preceding context.
Introduction
Recent advancements in large language models (LLMs) have significantly changed the landscape of human-computer interaction. However, the evaluation metrics traditionally used do not fully capture the model’s understanding of ongoing conversations. This article discusses a novel approach that shifts focus from assistant responses to user turn generation, aiming to measure interaction awareness in LLMs.
Proposed Methodology
We introduce a method called user-turn generation, which assesses how well a model can generate responses that reflect an understanding of the conversational context. The process involves:
- Providing a model with a user query followed by an assistant response.
- Allowing the model to generate a user turn that serves as a follow-up to the assistant’s response.
- Analyzing the generated user turn to determine if it demonstrates awareness of the prior interaction.
Experimental Setup
Our experiments were conducted across 11 open-weight LLMs, including Qwen3.5, gpt-oss, and GLM, and utilized 5 diverse datasets focusing on:
- Mathematical reasoning
- Instruction following
- Conversational dynamics
Findings
Our findings reveal several critical insights into the interaction awareness of language models:
- Interaction awareness is distinct from task accuracy, highlighting a gap between a model’s ability to perform tasks and its understanding of conversational context.
- Within the Qwen3.5 family, as model size increased from 0.8B to 397B parameters, the accuracy on GSM8K tasks improved from 41% to 96.8%. However, the genuine follow-up rate remained close to zero under deterministic generation.
- Higher temperature sampling yielded a latent interaction awareness, with follow-up rates reaching up to 22%.
Controlled Perturbations
To ensure the robustness of our findings, we conducted controlled perturbations. These experiments validated that user-turn generation effectively measures a real property of the model concerning interaction awareness.
Post-Training Enhancements
Further, we explored collaboration-oriented post-training on the Qwen3.5-2B model. The results indicated a notable increase in follow-up rates, suggesting that targeted training can enhance interaction awareness.
Conclusion
In summary, user-turn generation serves as a vital probe to uncover interaction awareness in LLMs, a dimension often overlooked by conventional assistant-only benchmarks. Our results encourage further exploration into this area, suggesting that enhancing interaction awareness can lead to more engaging and contextually aware AI systems.
