Natural Language Querying of Structured Data Using LLMs

Querying Structured Data Through Natural Language Using Language Models

Summary: arXiv:2604.03057v1 Announce Type: cross

Abstract

This paper presents an open source methodology for allowing users to query structured non-textual datasets through natural language. Unlike Retrieval Augmented Generation (RAG), which struggles with numerical and highly structured information, our approach trains a large language model (LLM) to generate executable queries. To support this capability, we introduce a principled pipeline for synthetic training data generation, producing diverse question-answer pairs that capture both user intent and the semantics of the underlying dataset.

Methodology

We fine-tune a compact model, DeepSeek R1 Distill 8B, using QLoRA with 4-bit quantization, making the system suitable for deployment on commodity hardware. Our approach emphasizes the importance of creating a robust framework that can effectively interpret user queries and translate them into structured database queries.

Evaluation

We evaluate our approach on a dataset describing accessibility to essential services across Durangaldea, Spain. The fine-tuned model achieves high accuracy across various scenarios, including:

Monolingual queries
Multilingual queries
Unseen location scenarios

This demonstrates both robust generalization and reliable query generation capabilities of our model.

Results

Our results highlight that small domain-specific models can achieve high precision for this task without relying on large proprietary LLMs. This makes our methodology particularly suitable for resource-constrained environments and adaptable to broader multi-dataset systems. The evaluation metrics indicate that our model is not only efficient but also effective in understanding and processing natural language queries regarding structured data.

Conclusion

The findings from our research suggest that leveraging smaller, specialized models can provide an efficient alternative to larger models, especially in contexts where computational resources are limited. As we continue to refine our approach, we anticipate further advancements that will allow for greater adaptability and performance across diverse datasets.

Future Work

Future research will focus on:

Enhancing the dataset to cover more diverse scenarios and user queries
Improving the model’s ability to handle complex queries involving multiple parameters
Exploring the potential of integrating additional machine learning techniques to boost performance

Ultimately, our goal is to democratize access to structured data through natural language processing, making it accessible for a wide range of users and applications.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Natural Language Querying of Structured Data Using LLMs

Querying Structured Data Through Natural Language Using Language Models

Abstract

Methodology

Evaluation

Results

Conclusion

Future Work

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related