Querying Structured Data Through Natural Language Using Language Models
Summary: arXiv:2604.03057v1 Announce Type: cross
Abstract
This paper presents an open source methodology for allowing users to query structured non-textual datasets through natural language. Unlike Retrieval Augmented Generation (RAG), which struggles with numerical and highly structured information, our approach trains a large language model (LLM) to generate executable queries. To support this capability, we introduce a principled pipeline for synthetic training data generation, producing diverse question-answer pairs that capture both user intent and the semantics of the underlying dataset.
Methodology
We fine-tune a compact model, DeepSeek R1 Distill 8B, using QLoRA with 4-bit quantization, making the system suitable for deployment on commodity hardware. Our approach emphasizes the importance of creating a robust framework that can effectively interpret user queries and translate them into structured database queries.
Evaluation
We evaluate our approach on a dataset describing accessibility to essential services across Durangaldea, Spain. The fine-tuned model achieves high accuracy across various scenarios, including:
- Monolingual queries
- Multilingual queries
- Unseen location scenarios
This demonstrates both robust generalization and reliable query generation capabilities of our model.
Results
Our results highlight that small domain-specific models can achieve high precision for this task without relying on large proprietary LLMs. This makes our methodology particularly suitable for resource-constrained environments and adaptable to broader multi-dataset systems. The evaluation metrics indicate that our model is not only efficient but also effective in understanding and processing natural language queries regarding structured data.
Conclusion
The findings from our research suggest that leveraging smaller, specialized models can provide an efficient alternative to larger models, especially in contexts where computational resources are limited. As we continue to refine our approach, we anticipate further advancements that will allow for greater adaptability and performance across diverse datasets.
Future Work
Future research will focus on:
- Enhancing the dataset to cover more diverse scenarios and user queries
- Improving the model’s ability to handle complex queries involving multiple parameters
- Exploring the potential of integrating additional machine learning techniques to boost performance
Ultimately, our goal is to democratize access to structured data through natural language processing, making it accessible for a wide range of users and applications.
