TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning
The advent of large language models (LLMs) has revolutionized various scientific workflows, providing advanced assistance in generating textual content, analyzing data, and even creating figures. One of the significant challenges in this domain is the generation of high-quality figures from textual descriptions, particularly when these figures are represented as TikZ programs, which can be rendered into scientific images. A recent paper, arXiv:2603.03072v2, introduces TikZilla, a novel approach that addresses the limitations of existing datasets and modeling techniques for Text-to-TikZ tasks.
Challenges in Existing Approaches
Prior research has attempted to tackle the Text-to-TikZ generation problem by proposing various datasets and modeling strategies. However, many of these existing datasets are often too small and noisy to adequately capture the intricate complexity of TikZ. This inadequacy frequently results in mismatches between the textual descriptions and the rendered figures. Furthermore, traditional methods have predominantly relied on supervised fine-tuning (SFT) alone, which fails to expose the models to the rendered semantics of the figures. Consequently, this can lead to various errors, including:
- Looping issues in generated figures
- Inclusion of irrelevant content
- Incorrect spatial relations between elements
Introducing DaTikZ-V4 Dataset
To overcome these challenges, the authors of TikZilla have developed the DaTikZ-V4 dataset, which is over four times larger and significantly higher in quality compared to its predecessor, DaTikZ-V3. The new dataset is enriched with figure descriptions generated by LLMs, providing a more robust foundation for training models. By utilizing a more comprehensive dataset, TikZilla aims to improve the accuracy and fidelity of the generated TikZ figures.
Training the TikZilla Model
TikZilla is a family of small open-source Qwen models, specifically the 3B and 8B variants, trained using a two-stage pipeline. The initial stage employs supervised fine-tuning (SFT) to establish a baseline performance. Following this, reinforcement learning (RL) is utilized to refine the models further. In this stage, the authors employ an image encoder that has been trained via inverse graphics, providing semantically faithful reward signals that inform the model during training.
Evaluation and Results
Extensive human evaluations have been conducted with over 1,000 judgments to assess the performance of TikZilla. The findings reveal a significant improvement, with TikZilla scoring between 1.5 to 2 points higher than its base models on a 5-point scale. Notably, it surpasses the performance of GPT-4o by 0.5 points and matches the capabilities of GPT-5 in image-based evaluations, all while operating with much smaller model sizes.
Availability
The authors have committed to making the code, data, and models publicly available, thereby fostering further research and development in the Text-to-TikZ domain. This initiative not only enhances accessibility for researchers but also encourages collaborative advancements in the generation of scientific figures from textual descriptions.
