Independent Reproduction of OpenAI GPT-OSS-20B Scores

Date:

In Harmony with gpt-oss

Summary: arXiv:2604.00362v1 Announce Type: new

Abstract: No one has independently reproduced OpenAI’s published scores for gpt-oss-20b with tools, because the original paper discloses neither the tools nor the agent harness. We reverse-engineered the model’s in-distribution tools: when prompted without tool definitions, gpt-oss still calls tools from its training distribution with high statistical confidence — a strong prior, not a hallucination. We then built a native harmony agent harness that encodes messages in the model’s native format, bypassing the lossy Chat Completions conversion. Together, these yield the first independent reproduction of OpenAI’s published scores: 60.4% on SWE Verified HIGH (published 60.7%), 53.3% MEDIUM (53.2%), and 91.7% on AIME25 with tools (90.4%).

Introduction

The development of AI language models has progressed rapidly, with various organizations attempting to push the boundaries of what these models can achieve. Among these is OpenAI’s gpt-oss-20b, which has garnered attention for its impressive performance metrics. However, the lack of independent verification of its scores has raised questions within the AI community.

Challenges in Reproduction

One of the primary challenges in reproducing the results of gpt-oss-20b lies in the absence of detailed information regarding the tools and agent harness utilized in the original evaluation. This gap has hindered researchers from accurately assessing the model’s capabilities.

Reverse Engineering the Model

In response to this challenge, our team took the initiative to reverse-engineer the in-distribution tools of the gpt-oss model. The findings revealed that even without explicit tool definitions, gpt-oss demonstrates a robust ability to call tools from its training distribution with significant statistical confidence. This indicates that the model possesses a strong prior knowledge base rather than simply generating outputs based on hallucination.

Development of the Harmony Agent Harness

To facilitate a more accurate evaluation of gpt-oss, we created a native harmony agent harness. This tool encodes messages in the model’s native format, effectively bypassing the limitations associated with the lossy Chat Completions conversion process. The implementation of this agent harness represents a significant advancement in achieving precise communication with the model.

Results of Independent Reproduction

With the combination of reverse-engineered tools and the new harmony agent harness, we successfully achieved the first independent reproduction of OpenAI’s published scores for gpt-oss-20b. The results are as follows:

  • SWE Verified HIGH: 60.4% (originally reported as 60.7%)
  • SWE Verified MEDIUM: 53.3% (originally reported as 53.2%)
  • AIME25 with tools: 91.7% (originally reported as 90.4%)

Conclusion

Our findings underscore the importance of transparency in AI research and the need for robust methods to evaluate the performance of advanced models. The successful reproduction of gpt-oss-20b’s scores not only validates OpenAI’s original claims but also opens the door for further exploration and understanding of the capabilities of AI language models.

Future Directions

As the field of artificial intelligence continues to evolve, it is crucial for researchers to share methodologies and insights to ensure the integrity of model evaluations. Our work with gpt-oss-20b serves as a foundation for future studies aimed at enhancing AI performance and reliability.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.