AEC-Bench: A Multimodal Benchmark for Agentic Systems in Architecture, Engineering, and Construction
The field of Architecture, Engineering, and Construction (AEC) is rapidly evolving with the integration of artificial intelligence (AI) and machine learning technologies. A new benchmark named AEC-Bench has been introduced to evaluate agentic systems that can perform real-world tasks in this domain. The benchmark aims to enhance the understanding and capabilities of AI systems when dealing with complex AEC tasks.
Overview of AEC-Bench
Recently released as arXiv:2603.29199v1, AEC-Bench is designed to facilitate the assessment of AI systems across a variety of AEC-related tasks. The benchmark is particularly focused on three main areas:
- Drawing Understanding: Evaluating the ability of AI systems to interpret architectural drawings and blueprints.
- Cross-Sheet Reasoning: Assessing the proficiency of AI in making connections between different documents and sheets related to construction projects.
- Construction Project-Level Coordination: Measuring the system’s capability to manage and coordinate tasks at the project level effectively.
Motivation and Objectives
The primary motivation behind the development of AEC-Bench is to establish a standardized method for evaluating agentic systems that operate within the AEC landscape. The benchmark seeks to:
- Provide a comprehensive dataset that reflects real-world challenges faced in architecture, engineering, and construction.
- Enable researchers to identify consistent tools and design techniques that can universally improve the performance of AI models.
- Facilitate the development of new methodologies and models that can handle the intricacies of AEC tasks.
Dataset Taxonomy and Evaluation Protocol
AEC-Bench features a meticulously curated dataset that categorizes tasks based on their complexity and requirements. The dataset is structured to ensure that it encompasses a wide array of scenarios faced in actual AEC projects. The evaluation protocol has been designed with the intent to maintain objectivity and reproducibility in results. Researchers can follow a standardized approach to assess and compare the performance of various agentic systems.
Baseline Results and Future Directions
The initial results from baseline evaluations across several domain-specific foundation models, including Claude Code and Codex, illustrate the potential of AEC-Bench in driving advancements in AI for the AEC sector. The findings indicate significant opportunities for improving AI performance through the application of consistent tools and design methodologies.
Furthermore, the AEC-Bench team has committed to transparency and collaboration by openly releasing the benchmark dataset, agent harness, and evaluation code. This can be accessed at https://github.com/nomic-ai/aec-bench, under an Apache 2 license, ensuring that researchers and practitioners can replicate and build upon this work.
Conclusion
AEC-Bench stands as a pivotal resource in the intersection of AI and the AEC industry. As agentic systems become increasingly integrated into construction processes, benchmarks like AEC-Bench will play a crucial role in improving their capabilities, ensuring that they can effectively meet the complex demands of real-world applications.
