CSPO: Alleviating Reward Ambiguity for Structured Table-to-LaTeX Generation
Summary: arXiv:2604.10918v1 Announce Type: new
Abstract: Tables contain rich structured information, yet when stored as images their contents remain “locked” within pixels. Converting table images into LaTeX code enables faithful digitization and reuse, but current multimodal large language models (MLLMs) often fail to preserve structural, style, or content fidelity. Conventional post-training with reinforcement learning (RL) typically relies on a single aggregated reward, leading to reward ambiguity that conflates multiple behavioral aspects and hinders effective optimization.
We propose Component-Specific Policy Optimization (CSPO), an RL framework that disentangles optimization across LaTeX tables components—structure, style, and content. In particular, CSPO assigns component-specific rewards and backpropagates each signal only through the tokens relevant to its component, alleviating reward ambiguity and enabling targeted component-wise optimization. To comprehensively assess performance, we introduce a set of hierarchical evaluation metrics. Extensive experiments demonstrate the effectiveness of CSPO, underscoring the importance of component-specific optimization for reliable structured generation.
Background
The digitization of tables plays a crucial role in data accessibility and usability within various fields, including academia, data science, and engineering. However, images of tables do not allow easy manipulation or analysis of their contents. LaTeX, a typesetting system commonly used for scientific documents, provides a robust framework for representing structured information such as tables. Thus, converting table images into LaTeX code is essential for unlocking this information.
The Challenge of Reward Ambiguity
Current methodologies in training MLLMs for table-to-LaTeX generation are often hampered by reward ambiguity. This phenomenon arises when multiple performance metrics are aggregated into a single reward signal, obscuring the distinct contributions of each aspect of table generation—structural integrity, stylistic consistency, and content accuracy. As a result, the optimization process becomes less efficient, leading to suboptimal outcomes.
Introducing Component-Specific Policy Optimization (CSPO)
CSPO addresses these challenges by implementing a novel approach that focuses on individual components of table generation. The framework operates as follows:
- Component-Specific Rewards: CSPO defines distinct rewards for structure, style, and content, allowing for a more nuanced evaluation of model performance.
- Targeted Backpropagation: By backpropagating rewards only through relevant tokens, CSPO ensures focused optimization efforts, leading to improved fidelity in each aspect of the generated LaTeX tables.
- Hierarchical Evaluation Metrics: To accurately gauge performance, a set of hierarchical metrics is employed, providing a comprehensive overview of the model’s efficacy across different components.
Experimental Results
Extensive experiments conducted using CSPO have demonstrated its superiority in generating structured LaTeX tables. Results indicate that models trained with CSPO significantly outperform those relying on traditional RL methods in terms of structural integrity, stylistic adherence, and content accuracy.
Conclusion
The introduction of CSPO marks a significant advancement in the field of table-to-LaTeX generation. By alleviating reward ambiguity and focusing on component-specific optimization, this framework enhances the reliability and effectiveness of structured generation, paving the way for improved data reuse and accessibility.
