JTON: A Token-Efficient JSON Superset with Zen Grid Tabular Encoding for Large Language Models
Summary: arXiv:2604.05865v1 Announce Type: new
Abstract: When LLMs process structured data, the serialization format directly affects cost and context utilization. Standard JSON wastes tokens repeating key names in every row of a tabular array—overhead that scales linearly with row count. This paper presents JTON (JSON Tabular Object Notation), a strict JSON superset whose main idea, Zen Grid, factors column headers into a single row and encodes values with semicolons, preserving JSON’s type system while cutting redundancy.
Key Features of JTON
The introduction of JTON brings several important features aimed at improving efficiency in data processing for large language models (LLMs). The following points summarize these features:
- Token Efficiency: JTON reduces token counts by 15-60% compared to standard JSON formats, with an average reduction of 28.5% and up to 32% when using bare strings.
- Comprehension Tests: Tests conducted on 10 LLMs reveal a net accuracy gain of +0.3 percentage points over traditional JSON formats. While four models demonstrated improvement, three maintained their performance, and three experienced slight declines.
- Generation Validity: Generation tests involving 12 LLMs confirmed 100% syntactic validity in both few-shot and zero-shot settings, showcasing JTON’s reliability in generating structured data.
- Rust/PyO3 Implementation: A reference implementation in Rust with PyO3 provides SIMD-accelerated parsing capabilities, achieving speeds 1.4 times faster than Python’s built-in JSON module, enhancing performance for developers.
Impact on Large Language Models
The implementation of JTON is expected to significantly impact the efficiency with which large language models process structured data. The reduction in token usage not only minimizes costs but also optimizes context utilization, which is critical for performance in data-heavy applications.
By addressing the redundancy inherent in traditional JSON formats, JTON facilitates a more streamlined approach for LLMs, allowing them to focus on generating accurate and contextually relevant outputs without the burden of excessive token overhead.
Availability and Future Work
All code, a comprehensive 683-vector test suite, and experimental data are publicly available, fostering community engagement and further research. The authors encourage developers and researchers to explore JTON in their projects and contribute to its ongoing enhancement.
Future work may explore additional optimizations and adaptations of JTON for various applications beyond structured data processing, potentially expanding its utility across diverse domains in natural language processing.
Conclusion
The introduction of JTON marks a significant advancement in the serialization of structured data for large language models. By leveraging the Zen Grid encoding strategy, JTON not only enhances token efficiency but also maintains the integrity of JSON’s type system, paving the way for improved performance in future applications of natural language processing.
