Evaluation of Large Language Models via Coupled Token Generation
Summary: arXiv:2502.01754v3 Announce Type: replace-cross
Abstract: State of the art large language models rely on randomization to respond to a prompt. As an immediate consequence, a model may respond differently to the same prompt if asked multiple times. In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning.
Our starting point is the development of a causal model for coupled autoregressive generation, which allows different large language models to sample responses with the same source of randomness. Building upon our causal model, we first show that, on evaluations based on benchmark datasets, coupled autoregressive generation leads to the same conclusions as vanilla autoregressive generation but using provably fewer samples.
However, we further show that, on evaluations based on (human) pairwise comparisons, coupled and vanilla autoregressive generation can surprisingly lead to different rankings when comparing more than two models, even with an infinite amount of samples. This suggests that the apparent advantage of a model over others in existing evaluation protocols may not be genuine but rather confounded by the randomness inherent to the generation process.
Key Findings
- Coupled autoregressive generation allows for sampling responses with a shared source of randomness across different models.
- On benchmark datasets, coupled autoregressive generation can achieve the same evaluation results as vanilla autoregressive generation, while requiring up to 75% fewer samples.
- In pairwise comparisons, rankings can differ significantly between coupled and vanilla autoregressive generation, indicating potential inconsistencies in model evaluations.
- Experiments conducted with models from the Llama, Mistral, and Qwen families support the findings of the study.
Implications for Future Research
The findings of this research hold significant implications for the future of model evaluation in natural language processing. By highlighting the impact of randomness in evaluations, researchers and developers can adopt more robust methodologies that ensure the reliability and validity of their assessments. This work paves the way for further explorations into the intricacies of model performance, potentially leading to the development of improved evaluation protocols that better reflect the capabilities of large language models.
Conclusion
As large language models continue to evolve and permeate various applications, understanding their evaluation becomes increasingly important. This study emphasizes the need to reconsider current evaluation techniques and suggests that coupled autoregressive generation may offer a more reliable framework for assessing model performance. By controlling for randomness, researchers can gain clearer insights into the true capabilities of these advanced AI systems, ultimately contributing to their responsible and effective deployment in real-world scenarios.
