SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models
Summary: arXiv:2604.12377v1 Announce Type: cross
Abstract
Korean is a morphologically rich language with a featural writing system in which each character is systematically composed of subcharacter units known as Jamo. These subcharacters not only determine the visual structure of Korean but also encode frequent and linguistically meaningful morphophonological processes. However, most current Korean language models (LMs) are based on subword tokenization schemes, which are not explicitly designed to capture the internal compositional structure of characters.
Introduction
To address this limitation, we propose SCRIPT, a model-agnostic module that injects subcharacter compositional knowledge into Korean pre-trained language models (PLMs). SCRIPT enhances subword embeddings with structural granularity, allowing for deeper linguistic understanding without necessitating architectural changes or additional pre-training.
Key Features of SCRIPT
- Subcharacter Injection: SCRIPT introduces subcharacter-level information into existing models, helping to capture the inherent structure of Korean characters.
- No Architectural Changes: The module can be integrated into current language models without requiring modifications to their underlying architecture.
- Performance Enhancement: SCRIPT has been shown to improve performance in various Korean natural language understanding (NLU) and generation (NLG) tasks.
Performance Results
In empirical evaluations, SCRIPT consistently enhanced baseline models across multiple tasks. This performance boost is notable in areas such as sentiment analysis, machine translation, and text summarization. By incorporating subcharacter-level insights, SCRIPT allows models to better understand the nuances of the Korean language.
Linguistic Analysis
Beyond performance improvements, a detailed linguistic analysis reveals that SCRIPT reshapes the embedding space of language models. This reshaping helps in capturing grammatical regularities and semantically cohesive variations more effectively. These insights can be particularly beneficial for researchers and developers working with Korean language processing.
Conclusion
The introduction of SCRIPT marks a significant advancement in the treatment of Korean in natural language processing. By bridging the gap between subword tokenization and the rich morphological structure of Korean, SCRIPT not only enhances model performance but also provides valuable linguistic insights. Researchers and developers interested in leveraging this technology can access the code at https://github.com/SungHo3268/SCRIPT.
Future Work
Looking ahead, further studies will explore the integration of SCRIPT with other language families and its applicability in multilingual contexts. Additionally, ongoing research will focus on refining the module to maximize its potential in various linguistic applications.
