In Harmony with gpt-oss
Summary: arXiv:2604.00362v1 Announce Type: new
Abstract: No one has independently reproduced OpenAI’s published scores for gpt-oss-20b with tools, because the original paper discloses neither the tools nor the agent harness. We reverse-engineered the model’s in-distribution tools: when prompted without tool definitions, gpt-oss still calls tools from its training distribution with high statistical confidence — a strong prior, not a hallucination. We then built a native harmony agent harness that encodes messages in the model’s native format, bypassing the lossy Chat Completions conversion. Together, these yield the first independent reproduction of OpenAI’s published scores: 60.4% on SWE Verified HIGH (published 60.7%), 53.3% MEDIUM (53.2%), and 91.7% on AIME25 with tools (90.4%).
Introduction
The development of AI language models has progressed rapidly, with various organizations attempting to push the boundaries of what these models can achieve. Among these is OpenAI’s gpt-oss-20b, which has garnered attention for its impressive performance metrics. However, the lack of independent verification of its scores has raised questions within the AI community.
Challenges in Reproduction
One of the primary challenges in reproducing the results of gpt-oss-20b lies in the absence of detailed information regarding the tools and agent harness utilized in the original evaluation. This gap has hindered researchers from accurately assessing the model’s capabilities.
Reverse Engineering the Model
In response to this challenge, our team took the initiative to reverse-engineer the in-distribution tools of the gpt-oss model. The findings revealed that even without explicit tool definitions, gpt-oss demonstrates a robust ability to call tools from its training distribution with significant statistical confidence. This indicates that the model possesses a strong prior knowledge base rather than simply generating outputs based on hallucination.
Development of the Harmony Agent Harness
To facilitate a more accurate evaluation of gpt-oss, we created a native harmony agent harness. This tool encodes messages in the model’s native format, effectively bypassing the limitations associated with the lossy Chat Completions conversion process. The implementation of this agent harness represents a significant advancement in achieving precise communication with the model.
Results of Independent Reproduction
With the combination of reverse-engineered tools and the new harmony agent harness, we successfully achieved the first independent reproduction of OpenAI’s published scores for gpt-oss-20b. The results are as follows:
- SWE Verified HIGH: 60.4% (originally reported as 60.7%)
- SWE Verified MEDIUM: 53.3% (originally reported as 53.2%)
- AIME25 with tools: 91.7% (originally reported as 90.4%)
Conclusion
Our findings underscore the importance of transparency in AI research and the need for robust methods to evaluate the performance of advanced models. The successful reproduction of gpt-oss-20b’s scores not only validates OpenAI’s original claims but also opens the door for further exploration and understanding of the capabilities of AI language models.
Future Directions
As the field of artificial intelligence continues to evolve, it is crucial for researchers to share methodologies and insights to ensure the integrity of model evaluations. Our work with gpt-oss-20b serves as a foundation for future studies aimed at enhancing AI performance and reliability.
