Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
Summary: arXiv:2604.02709v1 Announce Type: cross
The formal reasoning capabilities of large language models (LLMs) are essential for advancing automated software engineering. Yet, existing benchmarks for LLMs lack systematic evaluation grounded in computation and complexity. This gap leaves a critical question unanswered: can state-of-the-art (SOTA) LLMs comprehend the structured, hierarchical complexity of formal languages as defined by Computation Theory?
Introduction to ChomskyBench
To address this uncertainty, a new benchmark called ChomskyBench has been introduced. This benchmark is designed to systematically assess LLMs using the framework of the Chomsky Hierarchy, which categorizes formal languages into levels of complexity. Unlike previous approaches that merely employed vectorized classification for neural networks, ChomskyBench is the first to integrate full coverage of the Chomsky Hierarchy, process-trace evaluation through natural language, and deterministic symbolic verifiability.
Structure of ChomskyBench
ChomskyBench comprises a comprehensive suite of language recognition and generation tasks. These tasks are specifically designed to test the capabilities of LLMs at each level of the Chomsky Hierarchy, which includes:
- Type 0: Recursively enumerable languages
- Type 1: Context-sensitive languages
- Type 2: Context-free languages
- Type 3: Regular languages
Findings from Experiments
Extensive experiments using ChomskyBench reveal a clear performance stratification that aligns with the complexity levels defined by the hierarchy. The analysis indicates a direct relationship where a rise in task difficulty significantly impacts both inference length and overall performance. Key findings include:
- Larger models and advanced inference techniques yield notable relative performance improvements.
- However, these models encounter severe efficiency barriers; achieving reliable results necessitates prohibitively high computational costs.
- The limitations observed are primarily due to inefficiencies rather than absolute capability constraints.
Implications for Future Development
A time complexity analysis further illustrates that LLMs are substantially less efficient than traditional algorithmic programs when tasked with formal reasoning. These results not only outline the practical limitations of current LLMs but also underscore the ongoing necessity for conventional software tools in formal reasoning tasks.
Moreover, the insights gained from ChomskyBench can serve as a guiding framework for the development of future LLMs possessing enhanced formal reasoning capabilities. As the field evolves, understanding the boundaries and potentials of LLMs within the context of Computation Theory will be crucial in shaping the next generation of artificial intelligence.
Conclusion
In conclusion, while LLMs have made remarkable strides in natural language processing, their formal reasoning capabilities require more rigorous evaluation. Tools like ChomskyBench are instrumental in bridging this gap, offering a structured approach to assess and understand the complexities inherent in formal languages.
