COMPOSITE-Stem: A New Benchmark for AI in Scientific Discovery
The field of artificial intelligence (AI) is rapidly evolving, particularly in its application to scientific discovery. However, the integration of AI agents into real-world workflows has been hindered by a lack of comprehensive evaluations that can accurately measure their capabilities. In response to this challenge, a new benchmark called COMPOSITE-STEM has been introduced, aiming to bridge the evaluation gap and enhance the adoption of AI in scientific domains.
Overview of COMPOSITE-STEM
The benchmark, detailed in the preprint arXiv:2604.09836v1, comprises 70 expert-written tasks across four scientific disciplines: physics, biology, chemistry, and mathematics. These tasks have been meticulously curated by doctoral-level researchers, ensuring that they reflect the complexities and nuances of real scientific problems. The goal of COMPOSITE-STEM is to provide a more robust framework for assessing AI reasoning in a way that aligns with actual scientific inquiry.
Innovative Grading Protocol
One of the standout features of COMPOSITE-STEM is its hybrid grading approach. It combines exact-match grading and criterion-based rubrics with a unique LLM-as-a-jury grading protocol. This innovative method allows for a more flexible assessment of AI-generated outputs, focusing on the scientific merit rather than simply the correctness of answers. By employing this multifaceted grading system, COMPOSITE-STEM aims to better capture the capabilities of AI systems in generating scientifically meaningful results.
Evaluation and Results
To test the efficacy of the COMPOSITE-STEM benchmark, researchers utilized an adapted multimodal Terminus-2 agent within the Harbor agentic evaluation framework to evaluate four frontier AI models. The results were illuminating: the top-performing model achieved a score of only 21%. This outcome suggests that COMPOSITE-STEM identifies challenges and capabilities that are currently beyond the reach of existing AI agents, highlighting the potential for further advancement in AI-assisted scientific discovery.
Open-Sourced and Collaborative
In an effort to promote reproducibility and encourage further research, all tasks within the COMPOSITE-STEM benchmark have been open-sourced with the permission of contributors. This transparency is crucial for fostering collaboration within the scientific community and driving the next wave of innovation in AI technology. By making these resources available, the creators of COMPOSITE-STEM hope to inspire additional research aimed at accelerating scientific progress in the fields of physics, biology, chemistry, and mathematics.
Conclusion
The introduction of COMPOSITE-STEM marks a significant step forward in the evaluation of AI agents in scientific contexts. By addressing the limitations of existing benchmarks and providing a more comprehensive assessment framework, COMPOSITE-STEM has the potential to enhance the integration of AI into scientific workflows, ultimately accelerating discoveries that can benefit humanity. As AI continues to evolve, benchmarks like COMPOSITE-STEM will be essential in guiding its development and application in real-world scenarios.
