Independent Reproduction of OpenAI GPT-OSS-20B Scores

In Harmony with gpt-oss

Summary: arXiv:2604.00362v1 Announce Type: new

Abstract: No one has independently reproduced OpenAI’s published scores for gpt-oss-20b with tools, because the original paper discloses neither the tools nor the agent harness. We reverse-engineered the model’s in-distribution tools: when prompted without tool definitions, gpt-oss still calls tools from its training distribution with high statistical confidence — a strong prior, not a hallucination. We then built a native harmony agent harness that encodes messages in the model’s native format, bypassing the lossy Chat Completions conversion. Together, these yield the first independent reproduction of OpenAI’s published scores: 60.4% on SWE Verified HIGH (published 60.7%), 53.3% MEDIUM (53.2%), and 91.7% on AIME25 with tools (90.4%).

Introduction

The development of AI language models has progressed rapidly, with various organizations attempting to push the boundaries of what these models can achieve. Among these is OpenAI’s gpt-oss-20b, which has garnered attention for its impressive performance metrics. However, the lack of independent verification of its scores has raised questions within the AI community.

Challenges in Reproduction

One of the primary challenges in reproducing the results of gpt-oss-20b lies in the absence of detailed information regarding the tools and agent harness utilized in the original evaluation. This gap has hindered researchers from accurately assessing the model’s capabilities.

Reverse Engineering the Model

In response to this challenge, our team took the initiative to reverse-engineer the in-distribution tools of the gpt-oss model. The findings revealed that even without explicit tool definitions, gpt-oss demonstrates a robust ability to call tools from its training distribution with significant statistical confidence. This indicates that the model possesses a strong prior knowledge base rather than simply generating outputs based on hallucination.

Development of the Harmony Agent Harness

To facilitate a more accurate evaluation of gpt-oss, we created a native harmony agent harness. This tool encodes messages in the model’s native format, effectively bypassing the limitations associated with the lossy Chat Completions conversion process. The implementation of this agent harness represents a significant advancement in achieving precise communication with the model.

Results of Independent Reproduction

With the combination of reverse-engineered tools and the new harmony agent harness, we successfully achieved the first independent reproduction of OpenAI’s published scores for gpt-oss-20b. The results are as follows:

SWE Verified HIGH: 60.4% (originally reported as 60.7%)
SWE Verified MEDIUM: 53.3% (originally reported as 53.2%)
AIME25 with tools: 91.7% (originally reported as 90.4%)

Conclusion

Our findings underscore the importance of transparency in AI research and the need for robust methods to evaluate the performance of advanced models. The successful reproduction of gpt-oss-20b’s scores not only validates OpenAI’s original claims but also opens the door for further exploration and understanding of the capabilities of AI language models.

Future Directions

As the field of artificial intelligence continues to evolve, it is crucial for researchers to share methodologies and insights to ensure the integrity of model evaluations. Our work with gpt-oss-20b serves as a foundation for future studies aimed at enhancing AI performance and reliability.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Independent Reproduction of OpenAI GPT-OSS-20B Scores

In Harmony with gpt-oss

Introduction

Challenges in Reproduction

Reverse Engineering the Model

Development of the Harmony Agent Harness

Results of Independent Reproduction

Conclusion

Future Directions

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related