League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
Summary: arXiv:2507.22359v4 Announce Type: replace
Abstract: Although large language models (LLMs) have shown exceptional capabilities across a wide range of tasks, reliable evaluation remains a critical challenge due to data contamination, opaque operation, and subjective preferences. To address these issues, we propose League of LLMs (LOL), a novel benchmark-free evaluation paradigm that organizes multiple LLMs into a self-governed league for multi-round mutual evaluation.
Introduction
The rapid advancement of large language models (LLMs) has transformed various sectors, from customer service to content creation. However, evaluating these models accurately has proven to be a daunting task. Traditional evaluation methods often suffer from data contamination and subjective biases, leading to inconsistent results. Recognizing the need for a more reliable framework, researchers have introduced the League of LLMs (LOL) as a potential solution.
Core Features of the League of LLMs
LOL is built upon four essential criteria aimed at enhancing the evaluation process:
- Dynamic: The league format allows for continuous updates and adjustments based on model performance.
- Transparent: All evaluation processes are made clear and accessible, reducing uncertainties in results.
- Objective: By utilizing a self-governing system, LOL minimizes subjective biases that can skew results.
- Professional: The evaluation incorporates expert insights to refine the assessment of LLM capabilities.
Experimental Findings
In extensive experiments involving eight mainstream LLMs focused on mathematics and programming tasks, LOL demonstrated its effectiveness in distinguishing model capabilities. Notably, the internal ranking stability achieved a Top-k consistency of 70.7%. This high degree of consistency is crucial for ensuring that evaluations are reliable and repeatable.
Insights Beyond Traditional Paradigms
Beyond merely ranking models, LOL has revealed empirical findings that traditional paradigms often overlook. For instance, researchers observed “memorization-based answering” behaviors in certain models, indicating a reliance on previously encountered data rather than true understanding. Additionally, the evaluation indicated that higher in-family scores were present in the OpenAI model family, with a notable difference of 9 points (p < 0.05), showcasing the importance of family-specific training and architecture.
Availability and Future Directions
Recognizing the significance of this framework, the authors have made the LOL system and its code publicly available. This move is aimed at enriching the current LLM evaluation ecosystem and encouraging further research in this domain. By providing open access to the framework, the authors hope to foster collaboration and innovation in the evaluation of large language models.
Conclusion
The League of LLMs represents a significant step forward in the quest for reliable and objective evaluation of large language models. By creating a benchmark-free, self-governed league for mutual evaluation, LOL addresses many of the limitations inherent in traditional methods. As the field of artificial intelligence continues to evolve, frameworks like LOL will be crucial for ensuring that advancements in LLM technology can be effectively and fairly assessed.
