Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition
Abstract: Optical Chemical Structure Recognition (OCSR) is critical for converting 2D molecular diagrams from printed literature into machine-readable formats. While Vision-Language Models have shown promise in end-to-end OCR tasks, their direct application to OCSR remains challenging, and direct full-parameter supervised fine-tuning often fails. In this work, we adapt DeepSeek-OCR-2 for molecular optical recognition by formulating the task as image-conditioned SMILES generation.
This article discusses our innovative approach to enhance the performance of DeepSeek-OCR-2, a model originally designed for Optical Character Recognition (OCR), by fine-tuning it specifically for recognizing molecular structures. The need for effective Optical Chemical Structure Recognition has been underscored by the increasing volume of chemical literature available in printed formats, which requires conversion into digital formats for easier data manipulation and analysis.
Methodology
To tackle the challenges associated with the training of DeepSeek-OCR-2, we introduce a two-stage progressive supervised fine-tuning strategy:
- Stage One: We start with parameter-efficient Low-Rank Adaptation (LoRA), which allows us to adapt the model without extensive resource expenditure.
- Stage Two: We transition to selective full-parameter fine-tuning utilizing split learning rates, which optimizes the learning process for different parts of the model.
Our training regimen employs a large-scale corpus combining synthetic renderings from PubChem and realistic patent images from USPTO-MOL. This dual-source dataset enhances the model’s coverage and robustness, enabling it to better generalize across various molecular representations.
Results and Performance
Upon completion of the training process, our fine-tuned model, which we have named MolSeek-OCR, exhibited competitive capabilities in molecular structure recognition tasks. Notably, it achieved exact matching accuracies that are comparable to some of the best-performing image-to-sequence models currently available.
However, it is essential to note that while MolSeek-OCR performs admirably, it still falls short when compared to state-of-the-art image-to-graph models. This observation highlights the ongoing challenges in the field of Optical Chemical Structure Recognition and the need for further advancements.
Future Directions
In addition to the primary model development, we also explored reinforcement-style post-training and data-curation-based refinement techniques. Unfortunately, these methods did not yield improvements in the strict sequence-level fidelity that is critical for exact SMILES matching. Moving forward, our research will aim to identify more effective strategies that can enhance the fidelity of molecular recognition.
In conclusion, the adaptation of DeepSeek-OCR-2 demonstrates the potential for improving optical recognition systems within the domain of chemistry, paving the way for enhanced data extraction from printed chemical literature.
