Syntax Is Easy, Semantics Is Hard: Evaluating LLMs for LTL Translation
Summary: arXiv:2604.07321v1 Announce Type: cross
Abstract
Propositional Linear Temporal Logic (LTL) is a popular formalism for specifying desirable requirements and security and privacy policies for software, networks, and systems. Yet expressing such requirements and policies in LTL remains challenging because of its intricate semantics. Since many security and privacy analysis tools require LTL formulas as input, this difficulty places them out of reach for many developers and analysts. Large Language Models (LLMs) could broaden access to such tools by translating natural language fragments into LTL formulas.
Introduction
This paper evaluates that premise by assessing how effectively several representative LLMs translate assertive English sentences into LTL formulas. The evaluation employs both human-generated and synthetic ground-truth data, focusing on the effectiveness of the translations along syntactic and semantic dimensions.
Key Findings
The results reveal three main findings:
- Syntactic vs. Semantic Performance: In line with prior findings, LLMs tend to perform better on syntactic aspects of LTL than on semantic ones.
- Impact of Prompts: LLMs generally benefit from more detailed prompts, which help improve the quality of the translations.
- Task Reformulation: Reformulating the task as a Python code-completion problem substantially improves overall performance in translating natural language to LTL.
Discussion
Despite these positive findings, the study underscores significant challenges in conducting a fair evaluation of LLMs in this context. The intricacies of LTL semantics can lead to variations in translation quality that may not be easily quantifiable. Evaluating LLMs requires careful consideration of the evaluation criteria to ensure that both syntactic and semantic aspects are adequately measured.
Recommendations for Future Work
To advance the field, the authors propose several recommendations:
- Enhance the training datasets used for LLMs to include a wider variety of natural language expressions that correspond to LTL formulas.
- Incorporate a more diverse set of prompts to test the adaptability of LLMs in translating different forms of requirements and policies.
- Explore alternative methods for evaluating semantic accuracy, beyond traditional metrics, to better capture the nuances of LTL translations.
- Encourage collaboration between linguists, logicians, and AI researchers to develop more refined evaluation frameworks.
Conclusion
This paper highlights the potential of Large Language Models in democratizing access to LTL-based security and privacy analysis tools. While the findings indicate promising avenues for improvement, they also stress the need for ongoing research to address the complexities inherent in both syntax and semantics of LTL.
