Performance Evaluation of LLMs in Automated RDF Knowledge Graph Generation
Summary: arXiv:2603.29878v1 Announce Type: cross
Abstract: Cloud systems generate large, heterogeneous log data containing critical infrastructure, application, and security information. Transforming these logs into RDF triples enables their integration into knowledge graphs, improving interpretability, root-cause analysis, and cross-service reasoning beyond what raw logs allow. Large Language Models (LLMs) offer a promising approach to automate RDF knowledge graph generation; however, their effectiveness on complex cloud logs remains largely unexplored.
Introduction
This article evaluates multiple LLM architectures and prompting strategies for automated RDF extraction using a controlled framework with two pipelines for systematically processing semi-structured log data. The extraction pipeline integrates multiple LLMs to identify relevant entities and relationships, automatically generating subject-predicate-object triples.
Methodology
The study employed an extraction pipeline that combines various LLMs to process log data effectively. The following steps were taken:
- Creation of a reference Log-to-KG dataset from OpenStack logs using manual annotation and ontology-driven methods.
- Implementation of an evaluation pipeline to assess the generated RDF triples using both syntactic and semantic metrics.
- Testing of multiple LLM architectures with different prompting strategies, including Few-Shot, One-Shot, Zero-Shot, and advanced techniques like Tree-of-Thought.
Results
The analysis revealed that Few-Shot learning emerged as the most effective strategy, with the following results:
- Llama: Achieved a 99.35% F1 score and 100% valid RDF output.
- Qwen, NuExtract, and Gemma: Also performed well under Few-Shot prompting.
- Chain-of-Thought approaches: Maintained similar accuracy as Few-Shot methods.
- One-Shot prompting: Provided a lighter but effective alternative for RDF extraction.
- Zero-Shot and advanced strategies: Such as Tree-of-Thought, Self-Critique, and Generate-Multiple performed substantially worse.
Discussion
The results highlight the significance of contextual examples and prompt design in achieving accurate RDF extraction. The analysis also revealed model-specific limitations across different LLM architectures, suggesting that while some models excel in Few-Shot scenarios, they may not perform equally well in other prompting contexts.
Conclusion
This study underscores the potential of LLMs for automating RDF knowledge graph generation from cloud logs. By leveraging Few-Shot prompting and thorough evaluation frameworks, researchers can enhance the integration of cloud log data into knowledge graphs, thereby improving interpretability and analytical capabilities. Future work should focus on refining prompting strategies and expanding the dataset to further assess LLM performance across diverse log types.
