Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts
Summary: arXiv:2604.03127v1 Announce Type: cross
Abstract
Automated annotation of pedagogical dialogue is a high-stakes task where large language models (LLMs) often struggle without sufficient domain grounding. To address this challenge, we present a domain-adapted Retrieval-Augmented Generation (RAG) pipeline specifically designed for tutoring move annotation. Instead of fine-tuning the generative model itself, we focus on adapting the retrieval process. This is achieved by fine-tuning a lightweight embedding model on tutoring corpora and indexing dialogues at the utterance level to retrieve labeled few-shot demonstrations.
Methodology
Our approach was evaluated across two real tutoring dialogue datasets, namely TalkMoves and Eedi, utilizing three different LLM backbones: GPT-5.2, Claude Sonnet 4.6, and Qwen3-32b. The results from our best configuration show a significant improvement in the quality of annotation. Specifically, we achieved Cohen’s kappa values ranging from 0.526 to 0.580 on the TalkMoves dataset and from 0.659 to 0.743 on the Eedi dataset, both of which substantially outperform no-retrieval baselines that yielded Cohen’s kappa values ranging from 0.275 to 0.413 for TalkMoves and from 0.160 to 0.410 for Eedi.
Results
An ablation study conducted as part of our research reveals that the key factor driving these improvements is utterance-level indexing, rather than the quality of embeddings alone. The top-1 label match rates increased significantly, improving from 39.7% to 62.0% on TalkMoves and from 52.9% to 73.1% on Eedi under the domain-adapted retrieval methodology.
Implications
The retrieval process not only enhances the accuracy of annotations but also corrects systematic label biases that are often present in zero-shot prompting. Notably, the largest improvements were observed for rare and context-dependent labels. These findings suggest that focusing on adapting the retrieval component alone can be a practical and effective strategy for achieving expert-level annotation of pedagogical dialogue, all while keeping the generative model in its original, unmodified state.
Conclusion
In conclusion, our research demonstrates that by concentrating on the retrieval aspect of the RAG architecture, we can significantly improve the performance of pedagogical dialogue annotation systems. This not only enhances the reliability of automated systems in educational settings but also opens up new avenues for research and development in dialogue systems, particularly in educational technology.
Future Work
Future research will explore the potential of integrating additional datasets and refining the indexing techniques further to enhance the adaptability of the retrieval process. Additionally, exploring other domains beyond pedagogy could yield valuable insights and advancements in the field of dialogue annotation.
