CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models
In the rapidly evolving field of software architecture, large language models (LLMs) have emerged as invaluable co-pilots for developers and architects alike. However, despite their growing prominence, a significant gap exists in the ability to quantify and evaluate LLMs’ understanding of cloud-native software architecture. Addressing this need, researchers have introduced a new benchmark known as CAKE, which stands for Cloud Architecture Knowledge Evaluation.
The CAKE benchmark comprises 188 expert-validated questions that span four cognitive levels of Bloom’s revised taxonomy: recall, analyze, design, and implement. These questions are organized around five core topics related to cloud-native architecture, allowing for a comprehensive assessment of an LLM’s capabilities.
Methodology
The evaluation process involves testing 22 different model configurations, ranging from 0.5 billion to 70 billion parameters. This analysis encompasses four distinct families of LLMs. The assessment utilizes two primary formats: multiple-choice questions (MCQs), which are evaluated using a three-run majority voting system, and free responses (FR), scored through an LLM-as-a-judge approach. This dual-format evaluation enables a multifaceted understanding of each model’s architectural knowledge.
Key Findings
The evaluation yielded several noteworthy findings that illuminate the capabilities and limitations of LLMs in the context of cloud-native architecture:
- MCQ Plateau: The accuracy of MCQ responses plateaus at around 3 billion parameters, with the highest-performing model achieving an impressive 99.2% accuracy rate.
- Steady Scaling of Free-Response Scores: In contrast to the MCQ format, free-response scores demonstrate a steady increase across all cognitive levels, indicating that LLMs can articulate more complex ideas effectively as their size grows.
- Differentiation Between Formats: The two evaluation formats reveal distinct aspects of knowledge. While MCQ accuracy approaches a ceiling effect, free-responses continue to differentiate between models, suggesting that they measure different dimensions of understanding.
- Impact of Reasoning and Tool Augmentation: The addition of reasoning augmentation (+think) significantly enhances the quality of free responses. Conversely, tool augmentation (+tool) appears to negatively affect performance, particularly in smaller models.
Conclusion
The introduction of the CAKE benchmark marks a significant advancement in the assessment of large language models’ understanding of cloud-native software architecture. The findings underscore the importance of the evaluation format in shaping our understanding of architectural knowledge in LLMs. As the field continues to evolve, benchmarks like CAKE will play a critical role in guiding the development and enhancement of LLM capabilities, ensuring that they remain effective tools for software architects and developers.
