KWBench: Measuring Unprompted Problem Recognition in Knowledge Work
In a groundbreaking development, researchers have introduced the first version of KWBench (Knowledge Work Bench), a benchmark specifically designed to evaluate unprompted problem recognition in large language models (LLMs). This innovative tool aims to address a critical gap in existing benchmarks that predominantly focus on task completion and extraction against specified guidelines.
The essence of KWBench is to assess whether an LLM can identify a professional scenario before attempting to solve it. Traditional benchmarks have become saturated, often reducing knowledge work evaluations to simple extraction or completion tasks. In contrast, KWBench focuses on recognizing the underlying structure of a situation based solely on raw inputs.
Key Features of KWBench
- Comprehensive Task Collection: The benchmark comprises 223 tasks sourced from various professional domains, including acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design.
- Game-Theoretic Patterns: Each task is structured around formal game-theoretic patterns such as principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, and strategic interdependence.
- Structured Ground Truth: Each task carries a structured ground truth that encapsulates the expert’s understanding of the situation along with the anticipated failure modes.
- Three-Tier Scoring Rubric: Models are evaluated using a three-tier rubric, which includes a mandatory conjunctive check that encodes predicted wrong paths.
Evaluation Methodology
To validate the effectiveness of KWBench, the researchers evaluated 16 distinct models. The results were revealing, with the highest-performing model successfully passing only 27.9% of the tasks. Notably, the top two models demonstrated only a 31.7% agreement rate on their successful passes, highlighting the variability in performance among models.
Among the top eight models evaluated, 44 tasks were solved by exactly one model, illustrating the diversity in problem-solving capabilities. When routing across the top eight models, they collectively covered 50.7% of the benchmark tasks, nearly doubling the success rate of the best single model.
Insights and Conclusions
Conditional on passing, the quality scores across models converged at approximately 83%. However, unconditional scores revealed significant discrepancies. Interestingly, the same models that could articulate the relevant game-theoretic concepts when prompted frequently failed to apply them in an unprompted context.
The release of KWBench marks a significant shift in the evaluation of frontier models in knowledge work. It emphasizes the importance of recognizing the correct problem from the situation at hand, rather than solely focusing on execution once a problem has been framed. This novel approach could lead to more effective and intelligent applications of large language models in professional settings.
